Direkt zum Hauptbereich

Accelerating nested tables by "Letter Vectors"

One major problem when using Text Mining in PL/SQL is, that sooner or later associative arrays or even nested tables are required to store large lists of words and their attributes. Those collections are very slow to iterate. Now you may think, that a hashed value of a word you need to find can be used as index, what would be much faster here. Well, it would be. But Oracle 11g2 contains no hash value function, that will return unique results on lists with 10.000 words. On several tests I got double values after ~86 words, independent of the size you specify at the function parameters. I tried ORA_HASH as well as the DBMS_CRYPTO package's HASH-function. It was the similar problem.
Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used already large nested tables containing lots of words loaded from database. I thought about storing the offset of the initial letter of words in this nested table in an array. This array of INTEGER has a length of 26. Each field of the array (1-26) will contain the letter's offset in the nested table, what means that it has to be ordered alphabetically and the ranges of the initial letter can be stored when filling it. For example: Words starting with the letter A will start at offset 1 due to the alphabetical order and end at offset n. B will start at n+1 and end at m. So if we later have a word, starting with the initial B, we know, that we just need to search the nested table from offset of B to offset of C - 1.

The definition of a Letter Vector

The Letter Vector is an ordinary array, storing an integer value. The definition in Oracle will look like the following pattern:

TYPE letter_vector IS VARRAY(26) OF INTEGER;

Filling the array

When filling the nested table, we need to look for the change of the initial letter of the currently inserted word. If it differs from the stored value, we know that here a initial letter change occurs and we can store the new position for the new initial letter in the second associative array. After that, we store the new initial letter to compare in the loop.
The code below shows procedure LoadWordsList() which will load the froms from table, fill them into the first associative array and detecting the letter vectors. Especially the last part is interesting. Block 1 shows how a n initial letter change will be detected. At start of the procedure, the Letter Vector array is empty, its length is 0. Then the first word will be handled. It will be converted into upper case after cutting of the first letter of the word. Then the letter's ASCII value is taken to substract 64 (because the ASCII upper case letter A has a value of 65). So in case of 'A' the value calculated will be 1. Because our l_letter is larger than the Letter Vector array, the IF-statement will be executed.
In the IF-statement will a while-loop be handled. It will run as long as the Letter Vector will in its size be equal to l_letter. This is because some letters maybe don't have any words assigned, they start with. Just think about X...but this is dependent of the scope of your dictionary. When jumping over a letter like X, it's vector will not be empty - it will contain the same value like the next letter, matching a position. But it's important to fill the letter vector array completely. It must be at a length of 26 at the end of the loop. If it isn't, and you will try to look with the offset of letter Z (it will be 26) and your array is too short, you will get an Index-Out-Of-Bounds-exception at run time.

PROCEDURE LoadWordsList
IS
  i INTEGER;
  l_Vec letter_vector;
  l_letter INTEGER;
  CURSOR c_word IS
    SELECT word FROM dictionary
    ORDER BY word;
BEGIN
  BEGIN
    i := 1;
    FOR c IN c_word
    LOOP
      -- block 1: code for detecting a change of the initial letter
      l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;

      IF (l_letter >  l_Vec.COUNT) THEN                                  
        WHILE (l_letter > pVec.COUNT)
        LOOP
           l_Vec.EXTEND();
           l_Vec(pVec.COUNT) := i;
          END LOOP;
      END IF;

      -- normal code for filling the word list array
      l_list.EXTEND();
      l_list(i) := c.word;
      i := i + 1;
      ...
    END LOOP;
  EXCEPTION WHEN NO_DATA_FOUND THEN
    RAISE_APPLICATION_ERROR(-200001,'DIctionary does not contain any words');
  END;       
END; 

Faster iterations

Until now, we have done two things: declared a data type as Letter Vector and filled the Letter Vector array when filling the (alphabetically ordered) nested table. Now we just need to speed up iterating the thested table. When trying to find a word in it, there was no other way than iterating the whole nested table (in worst case). If the word is not contained, the whole array would be iterated.
But now, we can use the initial letter of the word to find, for identifying the range of the nested table containing only word starting with the same initial letter - because we have a vector pointing to it, a "Letter Vector".

PROCEDURE IterateFaster(pWord IN CHAR,)
IS
  i INTEGER;
  l_letter INTEGER;
  l_start INTEGER;   l_end INTEGER;
BEGIN
  -- fetch the initial letter of the word to find and convert (-64)
  l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;

  -- assigning start and stop offset by the Letter Vector
  l_start := l_Vec(l_letter);
  l_end := l_Vec(MIN(26, l_letter + 1));

  i := l_start;
  LOOP
    EXIT WHEN (i IS NULL) OR (i > l_end);
    IF (pList(i).word = pWord) THEN
      ...
    END IF;
          
    i := pList.NEXT(i);
  END LOOP;
END;

Some mathematics

But how much is ist faster by now, using a Letter Vector? This is a relatively simple calcuation. In an ordinary nested table, the steps of iteration are the average of the first entry in the nested table and the last one. So the avergare fields to iterate are:
(n+1)/2
Now we divide n into 26 ranges and have always just to iterate the range, which is a 26th part of n. Our average iterations will become:
(n+1)/2/26
And this is actually:
(n+1)/52
Now the algorithm will run almost 26 times faster (we lose some time by looking at the Letter Vector).
The previous formula implies that the distribution of the words, starting with each of the letters in the alphabet is normally distributed. But actually they are not because there are less worts starting with X or Q than those starting with an E. But since words starting with Q are more seldom than those starting with E, they appear more seldom. Further ther range of the nested table will be much smaller and faster to iterate. So the formula is still incorrect. I'll try to take a closer look at the distribution of the initial letters in my dictionary to correct that formula.

Kommentare

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

A little aside from the analytical topics of this blog, I also was occupied with a little ubiquitous computing project. It was about machine learning with a magnetic field sensor, the QMC5883. In the Arduino module GY-271, usually the chip HMC5883 is equipped. Unfortunately, in cheap modules from china, another chip is used: the QMC5883. And, as a matter of course, the software library used for the HMC5883 does not work with the QMC version, because the I2C adress and the usage is a little bit different. Another problem to me was, that I  didn't find any proper working source codes for that little magnetic field device, and so I had to debug a source code I found for Arduino at Github  (thanks to dthain ). Unfortunately it didn't work properly at this time, and to change it for the Raspberry Pi into Python. Below you can find the "driver" module for the GY-271 with the QMC5883 chip. Sorry for the bad documentation, but at least it will work on a Raspberry Pi 3. ...

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...

Lazarus IDE and TOracleConnection - A How-To

Free programming IDEs are a great benefit for everybody who's interested in Programming and for little but ambitious companies. One of these free IDEs is the Lazarus IDE . It's a "clone" of the Delphi IDE by Embarcadero (originally by Borland). But actually Lazarus is much more than a clone: Using the Free Pascal-Compiler , it was platform-independent and cross-compiling since it was started. I am using Lazarus very often - especially for building GUIs easily because Java is still Stone-Age when a GUI is required (though there is a couple of GUI-building tools - they all are much less performant than Delphi / Lazarus). In defiance of all benefits of Lazarus there still is one Problem. Not all Components are designed for use on a 64 bit systems. Considering that 64 bit CPUs are common in ordinary PCs since at least 2008, this is very anpleasant. One of the components which will not be available on 64 bit installations is the TOracleConnection of Lazarus' SQLDB ...