No Bytes

Posts

Es werden Posts vom November, 2013 angezeigt.

Accelerating nested tables by "Letter Vectors"

One major problem when using Text Mining in PL/SQL is, that sooner or later associative arrays or even nested tables are required to store large lists of words and their attributes. Those collections are very slow to iterate. Now you may think, that a hashed value of a word you need to find can be used as index, what would be much faster here. Well, it would be. But Oracle 11g2 contains no hash value function, that will return unique results on lists with 10.000 words. On several tests I got double values after ~86 words, independent of the size you specify at the function parameters. I tried ORA_HASH as well as the DBMS_CRYPTO package's HASH -function. It was the similar problem. Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used alrea

Disaggregating CLOBs in PL/SQL

For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required. Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable b_text . The a loop is run where a single word l_word will be cut off from the clob using a regular expression and the function REGEXP_