This document will help you understand the statistical information available in the report. Statistics are here to help you translate raw data into decision elements.
There are two different kinds of elements displayed in the report
- Chances to win
- Confidence intervals
These elements are computed from the collected data and with different formulas or algorithms depending on the data type and the context, but the meaning is always the same.
Chances to win (meaning)
This is displayed as a percentage and indicates the chances that a given variation should win over the original. A 75% chance of winning means that there is a 75% chance that the corresponding variation will be better than the original. And a 25% chance of winning also means that you have a 75% chance that the corresponding variation will be worse than the original.
A common threshold used in CRO is 95%, but it is important to note that this index does not give any indication of the size of the gain. In practice, estimating size is important as any variation implementation has a cost. This is why we provide (if possible) a gain estimate through the confidence interval.
Confidence interval (meaning)
Confidence intervals are given in the form of a pair of numbers between brackets. Example [7%,10%] tells that the corresponding gain is estimated between 7% and 10%. More exactly than that, there is a 95% chance that the true gain is between 7% and 10% because the confidence interval given in the report is configured as a "95% confidence interval". This means that it remains in 5% of the cases where the true value may not be in the interval. 5% may sound high, but one has to notice that this 5% error is evenly split into both sides of the confidence interval, meaning that only 2.5% is on the bad side of the interval (the one that would be below 7% in our example case). We consider that the other 2.5% chance of being outside the confidence interval (above 10% of gain) is a good thing if it ever happens.
This means that the boundaries of the confidence interval can be seen as best and worst-case scenarios.
For instance, a confidence interval [0%,10%] tells that the corresponding variation, considering the 2.5% worst cases, could be equivalent or worse than the original, and there is little chance that the gain would be more than 10%. Collecting more data will shrink the confidence interval, giving more accurate estimations. In this [0%,10%] case, it would be wise to wait and collect more data to exclude the 0% and have a margin ensuring (in the worst case) that the implementation cost will be covered.
How these statistics are computed
There are two kinds of statistical tools depending on the type of data analyzed:
- For conversion data, corresponding to the notion of success and failure rate we use a Bayesian framework. Typical data is the act of purchasing, reaching a given page, or consenting to subscribe to a newsletter... This framework gives us a chance to win index and confidence interval for the estimated gain.
- For transaction data, like the cart value, we use the Mann-Whitney U test which is robust to "extreme" values.
This test does not provide a confidence interval, so it only tells if the average cart value goes up or down, but no information is given about the estimated gain.
For clicks data, we use a Bayesian framework where clicks are represented as binomial distributions, whose parameters are the number of tries and a success rate. In the digital experimentation field, the number of tries is the number of visitors and the success rate is the click or transaction rate. In this case, it is important to note that the rates we are dealing with are only estimates for a limited number of visitors. To model this limited accuracy, we use beta distributions (which are the conjugate prior to binomial distributions).
These distributions model the likelihood of a success rate measured on a limited number of trials.
Let’s take an example:
- 1,000 visitors on A with 100 successes
- 1,000 visitors on B with 130 successes
We build the model
Ma = beta(1+success_a,1+failures_a)
Where success_a = 100, and failures_a = visitors_a – success_a =900.
(Note: the 1+ comes from the fact that this distribution can also have another shape and then model a different type of process.)
For the three following graphs, the horizontal axis is the click rate while the vertical axis is the likelihood of that rate knowing that we had an experiment with 100 successes in 1,000 trials.
We observe that 10% is the most likely, 5% or 15% are doubtful, and 11% is half as likely as 10%.
The model Mb is built the same way with data from experiment B:
For B, the most likely rate is 13%, and the width of the curve’s shape is close to the previous curve.
Then we compare A and B rate distributions.
We see an overlapping area, a 12% conversion rate, where both models have the same likelihood.
To estimate the overlapping region, we need to sample from both models to compare them.
We draw samples from distributions A and B:
- s_a[i] is the i th sample from A
- s_b[i] is the i th sample from B
Then we apply a comparison function to these samples:
- The relative gain: g[i] =100* (s_b[i] – s_a[i])/s_a[i]) for all i.
It is the difference between the possible rates for A and B, relative to A (multiplied by 100 for readability in %).
We can now analyze the samples g[i] with a histogram:
We see that the most likely value of the gain is around 30%.
The yellow line shows where the gain is 0, meaning no difference between A and B. Samples that are below this line correspond to cases where A > B, and samples on the other side are cases where A < B.
We then define the gain chances to win as:
CW = (number of samples > 0) / total number of samples
With 1,000,000 (10^6) samples for g, we have 982,296 samples that are >0, making
B>A ~98% probable.
We call this the “chances to win” or the “gain probability” (the probability that you will win something).
The gain probability is shown here (see the red rectan
gle) in the report:
Using the same sampling method, we can compute classic analysis metrics like the mean, median, percentiles, etc.
Looking back at the previous chart, the vertical red lines indicate where most of the blue area is, intuitively which gain values are the most likely.
We have chosen to expose a best and worst-case scenario with a 95% confidence interval. It excludes 2.5% of extreme best and worst cases, leaving out a total of 5% of what we consider rare events. This interval is delimited by the red lines on the graph. We consider that the real gain (as if we had an infinite number of visitors to measure it) lies somewhere in this interval 95% of the time.
In our example, this interval is [1.80%; 29.79%; 66.15%], meaning that it is quite unlikely that the real gain is below 1.8 %, and it is also quite unlikely that the gain is more than 66.15%. And there is an equal chance that the real rate is above or under the median, 29.79%.
It is important to note that, in this case, a 1.80% relative gain is quite small, and is maybe not worth implementation, at least not yet, even if the best-case scenario is very appealing (66%). This is why, in practice, we suggest waiting for at least 5000 visitors per variation before one calls a test "ready", to obtain a smaller confidence interval.
The confidence interval is shown here (in the red rectangle) in the report (on another experiment):
For data like transaction values, we use the Mann-Whitney U test for its nice property with extreme values.
A few customers ordering for huge value can raise a variation in average order value but are not significant by the number of people. Imagine that an A/B test holds 10 extreme values (let's say 10 customers that spend 20 times the average order value). The chance that these 10 visitors are not evenly split between A & B is quite high since the assignment is purely random. This will imply a noticeable difference between the average order value of A & B. But this difference is maybe not statistically significant because of the too small number of visitors concerned.
So it is important to trust the chances to win provided by this statistical test. It's not uncommon to see an observed average order value going up and the statistic says that the chance to win is below 50% showing an opposite trend. And the reverse may also happen: an observed negative trend for the average cart value can be a winner if the chances to win are above 95%.
Limitations of any test procedure
This approach is only valid for one comparison, when comparing several variations we need to take into account the implicit
risk augmentation. The basic idea is simple: when considering a winner if the chances to win are at 95%, we know we take a 5% risk to make a mistake. This means that making 2 comparisons when dealing with an ABC test, will lead to a 10% risk of making at
least one mistake.
To take this into account, we apply the Holm-Bonferonni method to the chances to win index when there are more than 2 variations (inclu
ding the original).