Direkt zum Hauptbereich

Posts

Text Mining 1 - The problem of propositional logic and natural languages

Natural Language Processing (NLP) is becoming more and more important. It is used to determine the meaning of something written (by a human being). And this is the reason, why it becomes more and more important. Because every day millions of humans write. They write comments on products they ordered, they write comments on facebook, the write their blog. Of course the industry wants their opinion about products, songs, ideas and everything else - just look at Facebook. Opinions are money. There is just (at least) one problem: How to catch this money in form of opinions? That's where Natural Language Processing is used. The first thing needed to process a (natural) language is a Parser. Most parsers process synthetic languagses such as programming languages and they have a defined syntax and a logical semantic. If the author of source code disregards the defined syntax or semantic, the parser will abort processing the code and throw an error. So the author of the code has to keep ...

Eisberg FileSync

Eisberg FileSync 2.1 BETA online Currently I finished a new Version of "Eisberg FileSync". But until further tests, it will be a BETA-Version and only for 64-bit systems. New Features: Multi-Sync in Arctos Traybar Fast Sync (synchrnize fast without creating any profiles) Download Links: 64 Bit application Eisberg FileSync 2.1 x64 BETA 32 Bit application Eisberg FileSync 2.1 x86 BETA

Using ORE 1.4 on Oracle 12.1c with pluggable databases

It is possible to use Oracle database tables in the R statistical software. And this is a very useful approach (if you know R's capabilities). The fact that you are on this blog now may means that you had no success trying to use tables in R and that you received ORA-12541 once more. To solve this problem, you can omit the next paragraph. However, for getting started with Oracle 12c and R you should take a look at the documentation provided by Oracle. Clear and brief it says that you have to do the following things to use Oracle database in R: Install the Oracle Instant Client if you don't have already from http://www.oracle.com/technetwork/database/features/instant-client/ Install Oracle's R distribution (ORD) from https://oss.oracle.com/OR Modify the PATH variable for the  path of Instant Client Set the environment variable OCI_LIB64 with the path to the Instant Client Install ORE Client Package for R from http://www.oracle.com/technetwork/database/options/advanced-...

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...

Accelerating nested tables by "Letter Vectors"

One major problem when using Text Mining in PL/SQL is, that sooner or later associative arrays or even nested tables are required to store large lists of words and their attributes. Those collections are very slow to iterate. Now you may think, that a hashed value of a word you need to find can be used as index, what would be much faster here. Well, it would be. But Oracle 11g2 contains no hash value function, that will return unique results on lists with 10.000 words. On several tests I got double values after ~86 words, independent of the size you specify at the function parameters. I tried ORA_HASH as well as the DBMS_CRYPTO package's HASH -function. It was the similar problem. Dependent of this problem and the long run times of the procedures mining the text, it was necessary to speed up the algorithm somehow. And the algorith was just iterating across a collection (nested table), comparing each entry with the wanted word, what was very slow. Here I got an idea. I used alrea...

Disaggregating CLOBs in PL/SQL

For a seminar during my Master's studies, I am currently occupied with Text Mining. Especially for that, I use a lot of CLOBs in an Oracle 11g2 database. Generally a CLOB is a Character Large Object and can store up to 8 terabytes of character data. The VARCHAR2 data type can just store up to 4000 characters. For a lot of applications 4000 characters is sufficient but for storing texts like a publication or a newspaper article it is not enough. Here a CLOB is required. Handling a CLOB can be very difficult because of its size. You need to cut of single words or even sentences. This can be very easy, using regular expressions. The process to disaggregate a text is called tokenization. The single words cut off from the text are the tokens. The code below shohs a procedures I use for tokenizing a CLOB. It will be loaded from the table into variable b_text . The a loop is run where a single word l_word will be cut off from the clob using a regular expression and the function REGEXP_...

Lazarus IDE, Oracle 11g and German Umlaute

Working with a database can be so easy - as long as you don't need to care about localization or languages with more or other letters than english. One of these Languages is German. And here the problem starts... Introduction Let's look at a simple scenario: Inserting a text with German letters like ä, ö, ü and ß into the Oracle 11g database. And right now there already could be a problem when trying to select it again. So what is the problem here? Actually the problem itself is very simple: The character encoding. However the solution is not as simple. I tried to connect to Oracle via an ODBC connection from Lazarus IDE for inserting german press releases with a lot of Umlauts (äöü). The result in the database was a mess. So how to solute this mess? The "problematic chain" First of all we have to consider that different operating systems have different character sets. Usually current Linux distributions have a Unicode character set in contrast to Windows, u...