One major problem when using Text Mining in PL/SQL is, that sooner or later associative arrays or even nested tables are required to store large lists of words and their attributes. Those collections are very slow to iterate. Now you may think, that a hashed value of a word you need to find can be used as index, what would be much faster here. Well, it would be. But Oracle 11g2 contains no hash value function, that will return unique results on lists with 10.000 words. On several tests I got double values after ~86 words, independent of the size you specify at the function parameters. I tried
Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used already large nested tables containing lots of words loaded from database. I thought about storing the offset of the initial letter of words in this nested table in an array. This array of
The code below shows procedure LoadWordsList() which will load the froms from table, fill them into the first associative array and detecting the letter vectors. Especially the last part is interesting. Block 1 shows how a n initial letter change will be detected. At start of the procedure, the Letter Vector array is empty, its length is 0. Then the first word will be handled. It will be converted into upper case after cutting of the first letter of the word. Then the letter's ASCII value is taken to substract 64 (because the ASCII upper case letter A has a value of 65). So in case of 'A' the value calculated will be 1. Because our l_letter is larger than the Letter Vector array, the IF-statement will be executed.
In the IF-statement will a while-loop be handled. It will run as long as the Letter Vector will in its size be equal to l_letter. This is because some letters maybe don't have any words assigned, they start with. Just think about X...but this is dependent of the scope of your dictionary. When jumping over a letter like X, it's vector will not be empty - it will contain the same value like the next letter, matching a position. But it's important to fill the letter vector array completely. It must be at a length of 26 at the end of the loop. If it isn't, and you will try to look with the offset of letter Z (it will be 26) and your array is too short, you will get an Index-Out-Of-Bounds-exception at run time.
But now, we can use the initial letter of the word to find, for identifying the range of the nested table containing only word starting with the same initial letter - because we have a vector pointing to it, a "Letter Vector".
The previous formula implies that the distribution of the words, starting with each of the letters in the alphabet is normally distributed. But actually they are not because there are less worts starting with X or Q than those starting with an E. But since words starting with Q are more seldom than those starting with E, they appear more seldom. Further ther range of the nested table will be much smaller and faster to iterate. So the formula is still incorrect. I'll try to take a closer look at the distribution of the initial letters in my dictionary to correct that formula.
ORA_HASH
as well as the DBMS_CRYPTO
package's HASH
-function. It was the similar problem.Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used already large nested tables containing lots of words loaded from database. I thought about storing the offset of the initial letter of words in this nested table in an array. This array of
INTEGER
has a length of 26. Each field of the array (1-26) will contain the letter's offset in the nested table, what means that it has to be ordered alphabetically and the ranges of the initial letter can be stored when filling it. For example: Words starting with the letter A will start at offset 1 due to the alphabetical order and end at offset n. B will start at n+1 and end at m. So if we later have a word, starting with the initial B, we know, that we just need to search the nested table from offset of B to offset of C - 1.The definition of a Letter Vector
The Letter Vector is an ordinary array, storing an integer value. The definition in Oracle will look like the following pattern:
TYPE letter_vector IS VARRAY(26) OF INTEGER;
Filling the array
When filling the nested table, we need to look for the change of the initial letter of the currently inserted word. If it differs from the stored value, we know that here a initial letter change occurs and we can store the new position for the new initial letter in the second associative array. After that, we store the new initial letter to compare in the loop.The code below shows procedure LoadWordsList() which will load the froms from table, fill them into the first associative array and detecting the letter vectors. Especially the last part is interesting. Block 1 shows how a n initial letter change will be detected. At start of the procedure, the Letter Vector array is empty, its length is 0. Then the first word will be handled. It will be converted into upper case after cutting of the first letter of the word. Then the letter's ASCII value is taken to substract 64 (because the ASCII upper case letter A has a value of 65). So in case of 'A' the value calculated will be 1. Because our l_letter is larger than the Letter Vector array, the IF-statement will be executed.
In the IF-statement will a while-loop be handled. It will run as long as the Letter Vector will in its size be equal to l_letter. This is because some letters maybe don't have any words assigned, they start with. Just think about X...but this is dependent of the scope of your dictionary. When jumping over a letter like X, it's vector will not be empty - it will contain the same value like the next letter, matching a position. But it's important to fill the letter vector array completely. It must be at a length of 26 at the end of the loop. If it isn't, and you will try to look with the offset of letter Z (it will be 26) and your array is too short, you will get an Index-Out-Of-Bounds-exception at run time.
PROCEDURE LoadWordsList
IS
i INTEGER;
l_Vec letter_vector;
l_letter INTEGER;
CURSOR c_word IS
SELECT word FROM dictionary
ORDER BY word;
BEGIN
BEGIN
i := 1;
FOR c IN c_word
LOOP
-- block 1: code for detecting a change of the initial letter
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;
IF (l_letter > l_Vec.COUNT) THEN
WHILE (l_letter > pVec.COUNT)
LOOP
l_Vec.EXTEND();
l_Vec(pVec.COUNT) := i;
END LOOP;
END IF;
-- normal code for filling the word list array
l_list.EXTEND();
l_list(i) := c.word;
i := i + 1;
...
END LOOP;
EXCEPTION WHEN NO_DATA_FOUND THEN
RAISE_APPLICATION_ERROR(-200001,'DIctionary does not contain any words');
END;
END;
IS
i INTEGER;
l_Vec letter_vector;
l_letter INTEGER;
CURSOR c_word IS
SELECT word FROM dictionary
ORDER BY word;
BEGIN
BEGIN
i := 1;
FOR c IN c_word
LOOP
-- block 1: code for detecting a change of the initial letter
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;
IF (l_letter > l_Vec.COUNT) THEN
WHILE (l_letter > pVec.COUNT)
LOOP
l_Vec.EXTEND();
l_Vec(pVec.COUNT) := i;
END LOOP;
END IF;
-- normal code for filling the word list array
l_list.EXTEND();
l_list(i) := c.word;
i := i + 1;
...
END LOOP;
EXCEPTION WHEN NO_DATA_FOUND THEN
RAISE_APPLICATION_ERROR(-200001,'DIctionary does not contain any words');
END;
END;
Faster iterations
Until now, we have done two things: declared a data type as Letter Vector and filled the Letter Vector array when filling the (alphabetically ordered) nested table. Now we just need to speed up iterating the thested table. When trying to find a word in it, there was no other way than iterating the whole nested table (in worst case). If the word is not contained, the whole array would be iterated.But now, we can use the initial letter of the word to find, for identifying the range of the nested table containing only word starting with the same initial letter - because we have a vector pointing to it, a "Letter Vector".
PROCEDURE IterateFaster(pWord IN CHAR,)
IS
i INTEGER;
l_letter INTEGER;
l_start INTEGER; l_end INTEGER;
BEGIN
-- fetch the initial letter of the word to find and convert (-64)
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;
-- assigning start and stop offset by the Letter Vector
l_start := l_Vec(l_letter);
l_end := l_Vec(MIN(26, l_letter + 1));
i := l_start;
LOOP
EXIT WHEN (i IS NULL) OR (i > l_end);
IF (pList(i).word = pWord) THEN
...
END IF;
i := pList.NEXT(i);
END LOOP;
END;
IS
i INTEGER;
l_letter INTEGER;
l_start INTEGER; l_end INTEGER;
BEGIN
-- fetch the initial letter of the word to find and convert (-64)
l_letter := ASCII(UPPER(SUBSTR(c.word, 1, 1))) - 64;
-- assigning start and stop offset by the Letter Vector
l_start := l_Vec(l_letter);
l_end := l_Vec(MIN(26, l_letter + 1));
i := l_start;
LOOP
EXIT WHEN (i IS NULL) OR (i > l_end);
IF (pList(i).word = pWord) THEN
...
END IF;
i := pList.NEXT(i);
END LOOP;
END;
Some mathematics
But how much is ist faster by now, using a Letter Vector? This is a relatively simple calcuation. In an ordinary nested table, the steps of iteration are the average of the first entry in the nested table and the last one. So the avergare fields to iterate are:
(n+1)/2
Now we divide n into 26 ranges and have always just to iterate the range, which is a 26th part of n. Our average iterations will become:
(n+1)/2/26
And this is actually:
(n+1)/52
Now the algorithm will run almost 26 times faster (we lose some time by looking at the Letter Vector).The previous formula implies that the distribution of the words, starting with each of the letters in the alphabet is normally distributed. But actually they are not because there are less worts starting with X or Q than those starting with an E. But since words starting with Q are more seldom than those starting with E, they appear more seldom. Further ther range of the nested table will be much smaller and faster to iterate. So the formula is still incorrect. I'll try to take a closer look at the distribution of the initial letters in my dictionary to correct that formula.
Kommentare
Kommentar veröffentlichen