Direkt zum Hauptbereich

Disaggregating CLOBs in PL/SQL

For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required.

Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable b_text. The a loop is run where a single word l_word will be cut off from the clob using a regular expression and the function REGEXP_SUBSTR. The regular expression [A-Za-zÄÖÜäöüß\-]* means that all the characters in the braces with an unknown count will be taken as result. Note that all other characters such as a blank will terminate the collection. So the term will collect only letters (including Umlauts) and the hyphen. After that the rest of b_text will be (re-)assigned to b_text. The Loop will run until the length of b_text is 0 or the cut word is NULL.

PROCEDURE Disaggregate(p_id IN INTEGER)
IS
  b_text CLOB;
  l_word VARCHAR2(100);
BEGIN
  BEGIN
    SELECT message INTO b_text
    FROM message_table
    WHERE id = p_id;
  EXCEPTION
    WHEN no_data_found THEN
      RAISE_APPLICATION_ERROR(-20001, 'Row with ID '|| p_mid ||' not found');
  END;

  -- tokenize text
  WHILE (LENGTH(b_text) > 1)
  LOOP
    -- fetch a single token
    l_word := REGEXP_SUBSTR(b_text, '[A-Za-zÄÖÜäöüß\-]*');

    -- jump to end of loop if word is NULL
    IF l_word IS NULL THEN
      EXIT;
    END IF;

    -- do some more operations...

    -- Assign rest text already here because at next step CONTINUE could be called...
    b_text := TRIM(SUBSTR(b_text, LENGTH(l_word)+1));
  END LOOP;
END;

Kommentare

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

A little aside from the analytical topics of this blog, I also was occupied with a little ubiquitous computing project. It was about machine learning with a magnetic field sensor, the QMC5883. In the Arduino module GY-271, usually the chip HMC5883 is equipped. Unfortunately, in cheap modules from china, another chip is used: the QMC5883. And, as a matter of course, the software library used for the HMC5883 does not work with the QMC version, because the I2C adress and the usage is a little bit different. Another problem to me was, that I  didn't find any proper working source codes for that little magnetic field device, and so I had to debug a source code I found for Arduino at Github  (thanks to dthain ). Unfortunately it didn't work properly at this time, and to change it for the Raspberry Pi into Python. Below you can find the "driver" module for the GY-271 with the QMC5883 chip. Sorry for the bad documentation, but at least it will work on a Raspberry Pi 3. ...

Lazarus IDE and TOracleConnection - A How-To

Free programming IDEs are a great benefit for everybody who's interested in Programming and for little but ambitious companies. One of these free IDEs is the Lazarus IDE . It's a "clone" of the Delphi IDE by Embarcadero (originally by Borland). But actually Lazarus is much more than a clone: Using the Free Pascal-Compiler , it was platform-independent and cross-compiling since it was started. I am using Lazarus very often - especially for building GUIs easily because Java is still Stone-Age when a GUI is required (though there is a couple of GUI-building tools - they all are much less performant than Delphi / Lazarus). In defiance of all benefits of Lazarus there still is one Problem. Not all Components are designed for use on a 64 bit systems. Considering that 64 bit CPUs are common in ordinary PCs since at least 2008, this is very anpleasant. One of the components which will not be available on 64 bit installations is the TOracleConnection of Lazarus' SQLDB ...

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...