No Bytes

Posts

Posts mit dem Label "Tokenize" werden angezeigt.

Disaggregating CLOBs in PL/SQL

For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required. Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable b_text . The a loop is run where a single word l_word will be cut off from the clob using a regular expression and the function REGEXP_...