Accelerating nested tables by "Letter Vectors"

One major problem when using Text Mining in PL/SQL is, that sooner or later associative arrays or even nested tables are required to store large lists of words and their attributes. Those collections are very slow to iterate. Now you may think, that a hashed value of a word you need to find can be used as index, what would be much faster here. Well, it would be. But Oracle 11g2 contains no hash value function, that will return unique results on lists with 10.000 words. On several tests I got double values after ~86 words, independent of the size you specify at the function parameters. I tried ORA_HASH as well as the DBMS_CRYPTO package's HASH-function. It was the similar problem.
Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used already large nested tables containing lots of words loaded from database. I thought about storing the offset of the initial letter of words in this nested table in an array. This array of INTEGER has a length of 26. Each field of the array (1-26) will contain the letter's offset in the nested table, what means that it has to be ordered alphabetically and the ranges of the initial letter can be stored when filling it. For example: Words starting with the letter A will start at offset 1 due to the alphabetical order and end at offset n. B will start at n+1 and end at m. So if we later have a word, starting with the initial B, we know, that we just need to search the nested table from offset of B to offset of C - 1.

The definition of a Letter Vector

The Letter Vector is an ordinary array, storing an integer value. The definition in Oracle will look like the following pattern:

TYPE letter_vector IS VARRAY(26) OF INTEGER;

Filling the array

When filling the nested table, we need to look for the change of the initial letter of the currently inserted word. If it differs from the stored value, we know that here a initial letter change occurs and we can store the new position for the new initial letter in the second associative array. After that, we store the new initial letter to compare in the loop.
The code below shows procedure LoadWordsList() which will load the froms from table, fill them into the first associative array and detecting the letter vectors. Especially the last part is interesting. Block 1 shows how a n initial letter change will be detected. At start of the procedure, the Letter Vector array is empty, its length is 0. Then the first word will be handled. It will be converted into upper case after cutting of the first letter of the word. Then the letter's ASCII value is taken to substract 64 (because the ASCII upper case letter A has a value of 65). So in case of 'A' the value calculated will be 1. Because our l_letter is larger than the Letter Vector array, the IF-statement will be executed.
In the IF-statement will a while-loop be handled. It will run as long as the Letter Vector will in its size be equal to l_letter. This is because some letters maybe don't have any words assigned, they start with. Just think about X...but this is dependent of the scope of your dictionary. When jumping over a letter like X, it's vector will not be empty - it will contain the same value like the next letter, matching a position. But it's important to fill the letter vector array completely. It must be at a length of 26 at the end of the loop. If it isn't, and you will try to look with the offset of letter Z (it will be 26) and your array is too short, you will get an Index-Out-Of-Bounds-exception at run time.

PROCEDURE LoadWordsList
IS
i INTEGER;
l_Vec letter_vector;
l_letter INTEGER;
CURSOR c_word IS
    SELECT word FROM dictionary
    ORDER BY word;
BEGIN
BEGIN
    i := 1;
    FOR c IN c_word
    LOOP
-- block 1: code for detecting a change of the initial letter
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;

      IF (l_letter > l_Vec.COUNT) THEN
        WHILE (l_letter > pVec.COUNT)
        LOOP
           l_Vec.EXTEND();
           l_Vec(pVec.COUNT) := i;
          END LOOP;
      END IF;

      -- normal code for filling the word list array
      l_list.EXTEND();
      l_list(i) := c.word;
      i := i + 1;
...
    END LOOP;
EXCEPTION WHEN NO_DATA_FOUND THEN
    RAISE_APPLICATION_ERROR(-200001,'DIctionary does not contain any words');
END;
END;

Faster iterations

Until now, we have done two things: declared a data type as Letter Vector and filled the Letter Vector array when filling the (alphabetically ordered) nested table. Now we just need to speed up iterating the thested table. When trying to find a word in it, there was no other way than iterating the whole nested table (in worst case). If the word is not contained, the whole array would be iterated.
But now, we can use the initial letter of the word to find, for identifying the range of the nested table containing only word starting with the same initial letter - because we have a vector pointing to it, a "Letter Vector".

PROCEDURE IterateFaster(pWord IN CHAR,)
IS
i INTEGER;
l_letter INTEGER;
l_start INTEGER; l_end INTEGER;
BEGIN
-- fetch the initial letter of the word to find and convert (-64)
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;

-- assigning start and stop offset by the Letter Vector
l_start := l_Vec(l_letter);
l_end := l_Vec(MIN(26, l_letter + 1));

i := l_start;
LOOP
    EXIT WHEN (i IS NULL) OR (i > l_end);
    IF (pList(i).word = pWord) THEN
      ...
    END IF;

    i := pList.NEXT(i);
END LOOP;
END;

Some mathematics

But how much is ist faster by now, using a Letter Vector? This is a relatively simple calcuation. In an ordinary nested table, the steps of iteration are the average of the first entry in the nested table and the last one. So the avergare fields to iterate are:

(n+1)/2

Now we divide n into 26 ranges and have always just to iterate the range, which is a 26th part of n. Our average iterations will become:

(n+1)/2/26

And this is actually:

(n+1)/52

Now the algorithm will run almost 26 times faster (we lose some time by looking at the Letter Vector).
The previous formula implies that the distribution of the words, starting with each of the letters in the alphabet is normally distributed. But actually they are not because there are less worts starting with X or Q than those starting with an E. But since words starting with Q are more seldom than those starting with E, they appear more seldom. Further ther range of the nested table will be much smaller and faster to iterate. So the formula is still incorrect. I'll try to take a closer look at the distribution of the initial letters in my dictionary to correct that formula.

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...

No Bytes

Dieses Blog durchsuchen