Direkt zum Hauptbereich

Data Analytics: How to protect against random correlations

Many analysts know it: You find a significant correlation - and then you wonder how secure the knowledge actually is. Of course, a good model inevitably requires an evaluation based on test data, because there is no way around it. But sometimes doubts remain - because even test data can be unreliable.

At the beginning of this year, during my work, I found such a significant correlation (0.8 after Spearman). However, an evaluation was only possible to a limited extent, because due to external circumstances many attributes of the data can change in the respective department and not all changes are documented. Incompleteness in the dataset - that too will be known to analysts.

Now I asked myself the question: how can I secure my knowledge nevertheless. The nature of the correlation was of central importance.

The situation was similar to a well-insulated building in which the room temperature is dependent on both the outside temperature and a heating. Due to the thick insulation of the building, the outside temperature has a time delay of about two days. After these two days, the temperature in the building has dropped so low that the heater will switch on and heat up the interior of the building. In this example, it quickly becomes apparent that this control loop applies only in winter. In the summer one has to do with a heating of the interior with the same time delay of two days. In winter, there is the additional problem that the heating eliminates the correlation between outside and inside temperature.
Now imagine that the heating of the building is under-dimensioned, and it can not keep the interior of the building to temperature, so that the correlation between outside and inside temperature is maintained.

Correlation between tboth temperatures with variable time offset (x axis)

And now we find exactly this correlation - but we do not yet know that the heating is too weak. In general, one would doubt the causality of the correlation, since the building is heated. And that was just the sticking point. So how do you protect the knowledge against "random" correlation? There is no general solution to such problems. But especially for this problem, a solution came to my mind.

What I did not know at the beginning of the investigation was the time span between the Delayed Temperature inside and the outside temperature. So I had to link my data with different time intervals. So I connected the data set of the outside temperature over the date with + X days with the data of the interior temperature and determined the correlation. This was laborious and brought me a series of readings that resembled a curve.

The temperature is constantly changing throughout the year - it is the seasonal element in general. With a time lag of 6 months, the outside temperature reaches its maximum or minimum. After reaching its maximum, the temperature gradually drops after June. After reaching their minimum, the temperature gradually rises after December. At 6 months (ie about 180 days), the correlation of the outside temperature with itself would have to be close to -1 because the behavior of the same magnitude reverses. At three months (about 90 days), the correlation would have to be at zero. The same applies to the interior temperature and +2 days more. Conversely, this would also mean that a curve symmetry would have to be present if the time was shifted in the other direction. So this now had to be tested. I now performed the same elaborate data transformation in the other direction and obtained the following curve of correlation measurements.

Correlation between tboth temperatures with variable time offset (x axis) from -90 to +90 days

So the calculated curve about Spearman's correlation coefficient was actually symmetric, and this suggested that it was not a random correlation but a causal one. In a random correlation, it would be extremely unlikely that the result would follow a curve - let alone a symmetric one. The following graphic illustrates what a random correlation would look like.

Random correlation coefficients across 180 days would show spikes like these


Kommentare

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

A little aside from the analytical topics of this blog, I also was occupied with a little ubiquitous computing project. It was about machine learning with a magnetic field sensor, the QMC5883. In the Arduino module GY-271, usually the chip HMC5883 is equipped. Unfortunately, in cheap modules from china, another chip is used: the QMC5883. And, as a matter of course, the software library used for the HMC5883 does not work with the QMC version, because the I2C adress and the usage is a little bit different. Another problem to me was, that I  didn't find any proper working source codes for that little magnetic field device, and so I had to debug a source code I found for Arduino at Github  (thanks to dthain ). Unfortunately it didn't work properly at this time, and to change it for the Raspberry Pi into Python. Below you can find the "driver" module for the GY-271 with the QMC5883 chip. Sorry for the bad documentation, but at least it will work on a Raspberry Pi 3. ...

Lazarus IDE and TOracleConnection - A How-To

Free programming IDEs are a great benefit for everybody who's interested in Programming and for little but ambitious companies. One of these free IDEs is the Lazarus IDE . It's a "clone" of the Delphi IDE by Embarcadero (originally by Borland). But actually Lazarus is much more than a clone: Using the Free Pascal-Compiler , it was platform-independent and cross-compiling since it was started. I am using Lazarus very often - especially for building GUIs easily because Java is still Stone-Age when a GUI is required (though there is a couple of GUI-building tools - they all are much less performant than Delphi / Lazarus). In defiance of all benefits of Lazarus there still is one Problem. Not all Components are designed for use on a 64 bit systems. Considering that 64 bit CPUs are common in ordinary PCs since at least 2008, this is very anpleasant. One of the components which will not be available on 64 bit installations is the TOracleConnection of Lazarus' SQLDB ...

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...