Text Mining 1 - The problem of propositional logic and natural languages

Natural Language Processing (NLP) is becoming more and more important. It is used to determine the meaning of something written (by a human being). And this is the reason, why it becomes more and more important. Because every day millions of humans write. They write comments on products they ordered, they write comments on facebook, the write their blog. Of course the industry wants their opinion about products, songs, ideas and everything else - just look at Facebook. Opinions are money. There is just (at least) one problem: How to catch this money in form of opinions? That's where Natural Language Processing is used.

The first thing needed to process a (natural) language is a Parser. Most parsers process synthetic languagses such as programming languages and they have a defined syntax and a logical semantic. If the author of source code disregards the defined syntax or semantic, the parser will abort processing the code and throw an error. So the author of the code has to keep the conventions and rules of the synthetic language to make the code compile. And this is already the biggest difference between synthetic languages and natural languages. A natural language has conventions that are more "smooth" since there is no reality that won't compile or run if the author makes a mistake. And though other human beings can find the error, they can immediately recognize what the original author of text realy wanted to say. But does a computer do?

The Stanford Parser is a parser for natural languages, a tool to undertand text that was written by humans or converted by a speech-to-text converter. It was developed at the Stanford University in California, USA. The parser uses to give a phrase structure tree as output. Such graphical trees were invented by Noam Chomsky in 1957 and represent words and grammar to describe a language's syntax. Chomsky is a US american linguist and scientist. During my Master's thesis I worked with the Stanford Parser and accidentially discovered, that it fails when parsing the German language with a specific, language-dependent parsing set. The sentence I used as input was "Es wurde nicht gut oder falsch ausgeführt" ("it has not been executed good or wrong"). The resulting phrase structure tree is shown in the picture below. The problem here is, that a human would unterstand, that something has been done wrong, or, in another case it has not been done well. But the parser assumes propositional logic for the natural language and interprets, that something has been done not well and not wrong at the same time. The reason lies in the propositional logic. One of its rules say, that NOT (A OR B) = (NOT A) AND (NOT B). And the resulting out put tree shows exactly this, but this is wrong for human interpretation because the semantic of the sentence excludes this. But the parser does not recognize the semantic and the meaning of the sentence.

But how to do it correctly? The explanation is very simple. The parser just needs to now, that the OR-case is an exclusive OR. In natural language, we almost always have an exclusive Or. You can just go this way or the other way. You do something or you do it not. But it is always an exclusive or. The or, working in a way like the or of propositional logic, is an artifical creation and it means A or B or both (all three cases result in a TRUE). But in reality, it is always wheter A or B. It seems, that the programmers of the parser didn't consider this and assumed propositional logic for natural languages, what is simply wrong. Considering this, I started a second test, using the sentence "es wurde entweder nicht gut oder falsch ausgeführt" ("it has been been executed whether not good or wrong") and it wasn't that amazing, that the parser not showed the correct output.

Conclusion

When parsing natural languages, one shouldn't do the mistake and assume that an artifical creation like propositional logic is inside that language and when using a parser to work with its output you should really close look at the output.

Experiment

To reproduce my experiment, use following software and configuration:

Stanford Parser-Release: 16.06.2014
Software-Version: 3.4
Parser-Modell: germanFactored.ser.gz
Input: Es wurde nicht gut oder falsch gemacht

No Bytes

Dieses Blog durchsuchen