For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required.
Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable
Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable
b_text
. The a loop is run where a single word l_word
will be cut off from the clob using a regular expression and the function REGEXP_SUBSTR
. The regular expression [A-Za-zÄÖÜäöüß\-]*
means that all the characters in the braces with an unknown count will be taken as result. Note that all other characters such as a blank will terminate the collection. So the term will collect only letters (including Umlauts) and the hyphen. After that the rest of b_text
will be (re-)assigned to b_text
. The Loop will run until the length of b_text
is 0 or the cut word is NULL.
PROCEDURE Disaggregate(p_id IN INTEGER)
IS
b_text CLOB;
l_word VARCHAR2(100);
BEGIN
BEGIN
SELECT message INTO b_text
FROM message_table
WHERE id = p_id;
EXCEPTION
WHEN no_data_found THEN
RAISE_APPLICATION_ERROR(-20001, 'Row with ID '|| p_mid ||' not found');
END;
-- tokenize text
WHILE (LENGTH(b_text) > 1)
LOOP
-- fetch a single token
l_word := REGEXP_SUBSTR(b_text, '[A-Za-zÄÖÜäöüß\-]*');
-- jump to end of loop if word is NULL
IF l_word IS NULL THEN
EXIT;
END IF;
-- do some more operations...
-- Assign rest text already here because at next step CONTINUE could be called...
b_text := TRIM(SUBSTR(b_text, LENGTH(l_word)+1));
END LOOP;
END;
IS
b_text CLOB;
l_word VARCHAR2(100);
BEGIN
BEGIN
SELECT message INTO b_text
FROM message_table
WHERE id = p_id;
EXCEPTION
WHEN no_data_found THEN
RAISE_APPLICATION_ERROR(-20001, 'Row with ID '|| p_mid ||' not found');
END;
-- tokenize text
WHILE (LENGTH(b_text) > 1)
LOOP
-- fetch a single token
l_word := REGEXP_SUBSTR(b_text, '[A-Za-zÄÖÜäöüß\-]*');
-- jump to end of loop if word is NULL
IF l_word IS NULL THEN
EXIT;
END IF;
-- do some more operations...
-- Assign rest text already here because at next step CONTINUE could be called...
b_text := TRIM(SUBSTR(b_text, LENGTH(l_word)+1));
END LOOP;
END;
Kommentare
Kommentar veröffentlichen