Direkt zum Hauptbereich

What's wrong with Scala? About Yate's Correction

Some months ago, I started a new job as a machine learning engineer for a german trading company. In the course of this, I began to deal more with the programming language Scala. Scala provides some benefits that makes the work easier for data engineers when building data pipelines and there's a good support for spark, which is an advantage in cluster computing. But Scala (of course) also has its idiosyncrasies.

This article shows the idiosyncrasy of Yate's correction in the Statistcs package of Scala and its consequences.

The Statistics Package

Like many other programming languages, Scala also comes with a statistical package called "Statistics". It includes many basic methods of the describing statistics, which you need for many purposes. One of these purposes is, for example, A/B testing.

So, what's wrong with Scala?

For an A/B test, I tested the significance by own calculations first (in old manner, as I had learned in my studies) before using the statistical package of Scala. Call me a control freak, but I want to understand when I'm possibly wrong and why. Anyway. The result was, that my own calculation differed from the one with the Scala Statistics package. My result was a Chi Squared value of 5.882 while the Statistics package said it is 5.510.

The test data was as follows: There have been two groups, A and B. In each group there was the possibility of the occurence of an event x - so x could have happend (x) or not (!x). Group B had a little bit different environment than group A (the control group). In group A, there were 9,500 visitors without event x and 950 with event x. In group B (the test group) were 500 visitors without event x and 32 with event x happened.

Event A B
x 9,500 500
!x 950 32


This means a probability of 10% for x in group A and 6.4% in group B. The question to answer was, if group B really had a smaller probability for x or was it just the effect of a too small sample? The Chi square independece test should answer this question.

The test needed to beat the Chi²-value of 3.84, so it was beaten anyway - but where did the difference come from? The most simple explanation was, that I was wrong. But I needed proof. So I tested the values with a short python code snippet below (the original Code from Jason Brownlee can be found here), which requires the module SciPy (installable with pip: pip install scipy).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from scipy.stats import chi2_contingency
from scipy.stats import chi2

# List 0: visitors, list 1: conversions
table = [[9500, 500],
         [950, 32]]

stat,p, dof, expected = chi2_contingency(table)

# Test-statistic
alpha = 0.05
prob = 1-alpha
critical = chi2.ppf(prob, dof)

print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))

The code above produced the following output below. As you can see, python also calculates the lower Chi²-value of 5.510. But the reason was not that I was wrong.


dof=1
[[9515.57093426  484.42906574]
 [ 934.42906574   47.57093426]]
probability=0.950, critical=3.841, stat=5.510
>>> 

After some research, I found the source of the deviation. It is located in the python method chi2_contingency which calculates the Chi-squared contingency of the contingency table. It gets a little bit clearer, if we take a look at its signature:


scipy.stats.chi2_contingency(observed, correction=True, lambda_=None)

As we can see, there are more parameters available than used inside the code. The source of the deviation is the parameter "correction". By default, it's set to True and this causes, that in a special case (only in 2x2 cintingency tables) the so called Yate's Correction will be applied when calculation the Chi²-value. And this correction is responsible for the lower Chi²-value. (A detailed explanation of Yate's Correction can be found here.)

The Chi squared test has an upwards bias which causes too high values - especially on low frequencies of observations. Yate's Correction decreases this effect hence it "pushes down" chi-square and thus provides a cautious or "conservative" interpretation of the results. In higher frequencies it should take less effect. But: There's a discussion between experts, because Yate's correction is too strict (see Bradley et al, 1979).

In Python it is possible to set the correction parameter to False. After this, the result fits the expectation as below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from scipy.stats import chi2_contingency
from scipy.stats import chi2

# List 0: visitors, list 1: conversions
table = [[9500, 500],
         [950, 32]]

# Correction parameter now set to False
stat, p, dof, expected = chi2_contingency(table, False)

# Test-statistic
alpha = 0.05
prob = 1-alpha
critical = chi2.ppf(prob, dof)

print('probability=%.3f, critical=%.3f, stat=%.3f' % (prob, critical, stat))

Finally, the result comes as expected:


probability=0.950, critical=3.841, stat=5.882
>>> 

Unfortunately, in the Statistics package of Scala there's no option for disabling the Yate's correction and so we currently have to live with it.

Consequences of Yate's correction

What's the consequence of using the Yate's correction in A/B testing with high frequencies? Hence the Chi²-value is lower than it actually is, there is a possibility that the A/B test result is interpreted that there's no statistical significant difference between goup A and B though there is one. And it leads to the assumption, that the tested change had no effect on the test group (but which would wrong).

Kommentare

Beliebte Posts aus diesem Blog

Pi And More 11 - QMC5883 Magnetic Field Sensor Class

A little aside from the analytical topics of this blog, I also was occupied with a little ubiquitous computing project. It was about machine learning with a magnetic field sensor, the QMC5883. In the Arduino module GY-271, usually the chip HMC5883 is equipped. Unfortunately, in cheap modules from china, another chip is used: the QMC5883. And, as a matter of course, the software library used for the HMC5883 does not work with the QMC version, because the I2C adress and the usage is a little bit different. Another problem to me was, that I  didn't find any proper working source codes for that little magnetic field device, and so I had to debug a source code I found for Arduino at Github  (thanks to dthain ). Unfortunately it didn't work properly at this time, and to change it for the Raspberry Pi into Python. Below you can find the "driver" module for the GY-271 with the QMC5883 chip. Sorry for the bad documentation, but at least it will work on a Raspberry Pi 3. ...

Lazarus IDE and TOracleConnection - A How-To

Free programming IDEs are a great benefit for everybody who's interested in Programming and for little but ambitious companies. One of these free IDEs is the Lazarus IDE . It's a "clone" of the Delphi IDE by Embarcadero (originally by Borland). But actually Lazarus is much more than a clone: Using the Free Pascal-Compiler , it was platform-independent and cross-compiling since it was started. I am using Lazarus very often - especially for building GUIs easily because Java is still Stone-Age when a GUI is required (though there is a couple of GUI-building tools - they all are much less performant than Delphi / Lazarus). In defiance of all benefits of Lazarus there still is one Problem. Not all Components are designed for use on a 64 bit systems. Considering that 64 bit CPUs are common in ordinary PCs since at least 2008, this is very anpleasant. One of the components which will not be available on 64 bit installations is the TOracleConnection of Lazarus' SQLDB ...

How to use TOracleConnection under Lazarus for Win64

Lazarus Programmers have had no possibility to use TOracleConnection under 64 Bit Windows and Lazarus for years. Even if you tried to use the TOracleConnection with a correctly configured Oracle 11g client, you were not able to connect to the Oracle Database. The error message was always: ORA-12154: TNS:could not resolve the connect identifier specified Today I found a simple workaround to fix this problem. It seems like the OCI.DLL from Oracle Client 11g2 is buggy. All my attempts to find identify the error ended here. I could exclude problems with the TNS systems in Oracle - or the Free Pascal file oracleconnection.pp though the error messages suggestes those problems. After investigating the function calls with Process Monitor (Procmon) I found out, that even the file TNSNAMES.ORA was found and read correctly by the Lazarus Test applictaion. So trouble with files not found or wrong Registry keys could also be eliminated. Finally I installed the Oracle Instant Client 12.1c - aft...