Direkt zum Hauptbereich

Text Mining 1 - The problem of propositional logic and natural languages

Natural Language Processing (NLP) is becoming more and more important. It is used to determine the meaning of something written (by a human being). And this is the reason, why it becomes more and more important. Because every day millions of humans write. They write comments on products they ordered, they write comments on facebook, the write their blog. Of course the industry wants their opinion about products, songs, ideas and everything else - just look at Facebook. Opinions are money. There is just (at least) one problem: How to catch this money in form of opinions? That's where Natural Language Processing is used.

The first thing needed to process a (natural) language is a Parser. Most parsers process synthetic languagses such as programming languages and they have a defined syntax and a logical semantic. If the author of source code disregards the defined syntax or semantic, the parser will abort processing the code and throw an error. So the author of the code has to keep the conventions and rules of the synthetic language to make the code compile. And this is already the biggest difference between synthetic languages and natural languages. A natural language has conventions that are more "smooth" since there is no reality that won't compile or run if the author makes a mistake. And though other human beings can find the error, they can immediately recognize what the original author of text realy wanted to say. But does a computer do?

The Stanford Parser is a parser for natural languages, a tool to undertand text that was written by humans or converted by a speech-to-text converter. It was developed at the Stanford University in California, USA. The parser uses to give a phrase structure tree as output. Such graphical trees were invented by Noam Chomsky in 1957 and represent words and grammar to describe a language's syntax. Chomsky is a US american linguist and scientist. During my Master's thesis I worked with the Stanford Parser and accidentially discovered, that it fails when parsing the German language with a specific, language-dependent parsing set. The sentence I used as input was "Es wurde nicht gut oder falsch ausgeführt" ("it has not been executed good or wrong"). The resulting phrase structure tree is shown in the picture below. The problem here is, that a human would unterstand, that something has been done wrong, or, in another case it has not been done well. But the parser assumes propositional logic for the natural language and interprets, that something has been done not well and not wrong at the same time. The reason lies in the propositional logic. One of its rules say, that NOT (A OR B) = (NOT A) AND (NOT B). And the resulting out put tree shows exactly this, but this is wrong for human interpretation because the semantic of the sentence excludes this. But the parser does not recognize the semantic and the meaning of the sentence.

But how to do it correctly? The explanation is very simple. The parser just needs to now, that the OR-case is an exclusive OR. In natural language, we almost always have an exclusive Or. You can just go this way or the other way. You do something or you do it not. But it is always an exclusive or. The or, working in a way like the or of propositional logic, is an artifical creation and it means A or B or both (all three cases result in a TRUE). But in reality, it is always wheter A or B. It seems, that the programmers of the parser didn't consider this and assumed propositional logic for natural languages, what is simply wrong. Considering this, I started a second test, using the sentence "es wurde entweder nicht gut oder falsch ausgeführt" ("it has been been executed whether not good or wrong") and it wasn't that amazing, that the parser not showed the correct output.

Conclusion

When parsing natural languages, one shouldn't do the mistake and assume that an artifical creation like propositional logic is inside that language and when using a parser to work with its output you should really close look at the output.

Experiment

To reproduce my experiment, use following software and configuration:

Stanford Parser-Release: 16.06.2014
Software-Version: 3.4
Parser-Modell: germanFactored.ser.gz
Input: Es wurde nicht gut oder falsch gemacht

Kommentare

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

A little aside from the analytical topics of this blog, I also was occupied with a little ubiquitous computing project. It was about machine learning with a magnetic field sensor, the QMC5883. In the Arduino module GY-271, usually the chip HMC5883 is equipped. Unfortunately, in cheap modules from china, another chip is used: the QMC5883. And, as a matter of course, the software library used for the HMC5883 does not work with the QMC version, because the I2C adress and the usage is a little bit different. Another problem to me was, that I  didn't find any proper working source codes for that little magnetic field device, and so I had to debug a source code I found for Arduino at Github  (thanks to dthain ). Unfortunately it didn't work properly at this time, and to change it for the Raspberry Pi into Python. Below you can find the "driver" module for the GY-271 with the QMC5883 chip. Sorry for the bad documentation, but at least it will work on a Raspberry Pi 3.

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft

Lazarus IDE and TOracleConnection - A How-To

Free programming IDEs are a great benefit for everybody who's interested in Programming and for little but ambitious companies. One of these free IDEs is the Lazarus IDE . It's a "clone" of the Delphi IDE by Embarcadero (originally by Borland). But actually Lazarus is much more than a clone: Using the Free Pascal-Compiler , it was platform-independent and cross-compiling since it was started. I am using Lazarus very often - especially for building GUIs easily because Java is still Stone-Age when a GUI is required (though there is a couple of GUI-building tools - they all are much less performant than Delphi / Lazarus). In defiance of all benefits of Lazarus there still is one Problem. Not all Components are designed for use on a 64 bit systems. Considering that 64 bit CPUs are common in ordinary PCs since at least 2008, this is very anpleasant. One of the components which will not be available on 64 bit installations is the TOracleConnection of Lazarus' SQLDB