Disaggregating CLOBs in PL/SQL

For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required.

Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable b_text. The a loop is run where a single word l_word will be cut off from the clob using a regular expression and the function REGEXP_SUBSTR. The regular expression [A-Za-zÄÖÜäöüß\-]* means that all the characters in the braces with an unknown count will be taken as result. Note that all other characters such as a blank will terminate the collection. So the term will collect only letters (including Umlauts) and the hyphen. After that the rest of b_text will be (re-)assigned to b_text. The Loop will run until the length of b_text is 0 or the cut word is NULL.

PROCEDURE Disaggregate(p_id IN INTEGER)
IS
b_text CLOB;
l_word VARCHAR2(100);
BEGIN
BEGIN
    SELECT message INTO b_text
    FROM message_table
    WHERE id = p_id;
EXCEPTION
    WHEN no_data_found THEN
      RAISE_APPLICATION_ERROR(-20001, 'Row with ID '|| p_mid ||' not found');
END;

-- tokenize text
WHILE (LENGTH(b_text) > 1)
LOOP
    -- fetch a single token
    l_word := REGEXP_SUBSTR(b_text, '[A-Za-zÄÖÜäöüß\-]*');

    -- jump to end of loop if word is NULL
    IF l_word IS NULL THEN
      EXIT;
    END IF;

    -- do some more operations...

    -- Assign rest text already here because at next step CONTINUE could be called...
    b_text := TRIM(SUBSTR(b_text, LENGTH(l_word)+1));
END LOOP;
END;

No Bytes

Dieses Blog durchsuchen

Disaggregating CLOBs in PL/SQL

Labels

Kommentare

Kommentar veröffentlichen

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

Lazarus IDE and TOracleConnection - A How-To

How to use TOracleConnection under Lazarus for Win64