The order value metric is both easy to understand and to calculate. It’s also of course a very important number. However, it has an unseen, counterintuitive behavior when used in the context of an A/B test.
The average order value is sensitive
Let’s take an example: here are two lists of values for variations A and B
- A: 10,20,30,40,50,60 €, average order value = 35 €
- B: 40,50,60,70,80,90 €, average order value = 65 €
With this data, B is the winning version, just by looking at the value of the average order value. And the statistical Mann-Whitney test confirms this, with a gain probability > 98% for variation B.
Up until now, everything is what you would expect...
But changing just one value will have a counterintuitive impact on the situation: (the 60€ value from variation A is replaced by 300€)
- A : 10,20,30,40,50,300 €, average order value = 75 €
- B : 40,50,60,70,80,90 €, average order value = 65 €
If we look only at the average order value, we would think that A is the winning version. Our winner ‘loses’, when compared to the previous experience.
However, we only changed a single value! What’s more, this value of 300€ can be considered an ‘extreme value’, and it’s therefore unlikely to reoccur. It’s therefore irrational to base one’s optimization strategy on one rare event.
One should be aware that this problem can also generate false positives. The average order value can be artificially increased by a few extreme values, which can lead us to believe we have a winning variation, when in fact no change has occurred - or even when there has been loss!
With this same data set, the Mann-Whitney test gives a probability gain of >94% for B.
The statistical test doesn’t get fooled by this extreme value, and still identifies version B as the winner.
This is the strength of this statistical test, based on a sorting system; it replaces values by a ‘ranking value’ as part of a ranking. Thanks to this, it’s relatively immune to extreme values (a good example is that the change made here from 60 to 300 only led to an increase of 3 places in the ranking, and only for one single value).
Take care, this strength is also a weakness! By using rankings and not actual values, the Mann-Whitney test gives no indication of the size of the average order gain between A and B. The Mann-Whitney test is only useful in determining the direction of the evolution of the average value.
In our example, we can note a decrease in the average order value, when in reality it increased. The opposite can also occur, which is more intuitive but just as inaccurate: we note an enormous increase in the average order value, confirmed by the Mann-Whitney test, but that will only lead to a very small increase when actually put into production. We could even see a decrease if we look only at the average order value, and if we forget to look at the Mann-Whitney test.
Recap: The average order value indicator seems intuitive, but it isn’t. If extreme values are present, there can be enormous natural variation. It can vary quite a bit, even without external interference (without tests), and it’s therefore crucial to use the indicator provided by the Mann-Whitney test. We have to accept that we can’t know anything about the size of the effect created by the test…
It’s also important to mention that this phenomenon impacts all metrics related to the order: total revenue, revenue per visitor...
If your strategy requires you to have an estimate of the amount of gain, for example, if variation B has a specific cost (product recommendation tool, chatbot…), get in touch with your AB Tasty consultant, so that you can configure the test so that you have complementary data that will allow you to estimate the amount of gain.