Contents

## Learn how to correctly calculate and interpret the effect size for your A/B tests!

As a data scientist, you will most likely come across the effect size while working on some kind of A/B testing. A possible scenario is that the company wants to make a change to the product (be it a website, mobile app, etc.) and your task is to make sure that the change will — to some degree of certainty — result in better performance in terms of the specified KPI.

This is when hypothesis testing comes into play. However, a statistical test can only inform about the likelihood that an effect exists. By effect, I simply mean a difference — it can just be a difference in either direction, but it can also be a more precise variant of a hypothesis stating that one sample is actually better/worse than the other one (in terms of the given metric). And to know how big the effect is, we need to calculate the effect size.

In this article, I will provide a brief theoretical introduction to the effect size and then show some practical examples of how to calculate it in Python.

Formally, the **effect size** is the quantified magnitude of a phenomenon we are investigating. As mentioned before, statistical tests result in the probability of observing an effect, however, they do not specify how big the effect actually is. This can lead to situations in which we detect a statistically significant effect, but it is actually so small that for the practical case (business) it is negligible and not interesting at all.

Additionally, when planning A/B tests, we want to estimate the expected duration of the test. This is connected to the topic of **power analysis**, which I covered in another article. To quickly summarize it, in order to calculate the required sample size, we need to specify three things: the significance level, the power of the test, and the effect size. Keeping the other two constant, the smaller the effect size, the harder it is to detect it with some kind of certainty, thus the larger is the required sample size for the statistical test.

In general, there are potentially hundreds of different measures of the effect size, each one with some advantages and drawbacks. In this article, I will present only a selection of the most popular ones. Before diving deeper into the rabbit hole, the measures of the effect size can be grouped into 3 categories, based on their approach to defining the effect. The groups are:

- Metrics based on the correlation
- Metrics based on differences (for example, between means)
- Metrics for categorical variables

The first two families cover continuous random variables, while the last one is used for categorical/binary features. To give a real-life example, we could apply the first two to a metric such as time spent in an app (in minutes), while the third family could be used for conversion or retention — expressed as a boolean.

I will describe some of the measures of effect size below, together with the Python implementation.

In this part, I will describe more in detail a few examples from each of the effect size families and show how to calculate them in Python using popular libraries. Of course, we could just as well code these functions ourselves, but I do believe there is no need for reinventing the wheel.

As the first step, we need to import the required libraries: