Doing A/B testing has been popular for a while and if you wander job descriptions and talent-recruiter emails you see that it’s been used as an element to lure people. Who wouldn’t want to work for a company that is not dictating behaviors of a system and in which the decisions are rationally taken by numbers?
The issue I’ve been witnessed during the last few years is that sometimes, not always, conclusions are drawn because this is the best the data could achieve. However, the statement declaring the winning scenario miss the asterisk mentioning that the final decision was taken without relying on the state of the art rational of the metric collected but by some assumption pulled from some numbers. The reasons of playing the number in one side more than another can be wide, but most of the time having a bad A/B test or misreading the results work in favor of someone who wants to pass an idea.
Let’s take one scenario of many. Imagine you have a system with many costly features. By costly, I mean that they are hard to maintain or cost money because of resources. Two people argue. One desire to cut the feature while the other one rather fixes all the issues. Before taking a decision, the one who would like to cut the feature suggest to do A/B testing with a control and base group which would determine if users are more susceptible to leave the product if these features got removed. This looks rational, both people agree. They remove one feature. After a while, the data come back and the result shows that users count was steady and very similar to the base group. The conclusion is that removing the feature was not significant. Engineers change the code by removing the tested feature on the product. Then another feature is being tested on a sample of users — the same result. After iterating for a while, all expensive features have been taken off. The last A/B showed that more people left than the previous one, not by a huge amount but still enough to say to keep the feature. Still, a good gain since in the group barely all the features got cut.
There are two problems with this scenario. The main one is the fallacy that there was enough data collected to take a decision. Removing content or altering existing content goes beyond the data that can be collected by the product where the change is made. If you remove a feature and the user is not satisfied, the result doesn’t mean that he will leave the product. He might keep continuing because of other features. However, he might as well join another service to complement the gap. The unsatisfaction is hard to collect. For example, you have a music service, you remove some editing feature, maybe the user keep using the service because you have a great catalog of music, but also will use another service for editing music. The problem here is that if the editing music service gets a good catalog that the user will eventually leave which by the time won’t be shown in the result of the old A/B testing.
The second problem in the scenario described is that the feature was A/B tested individually. Do not get me wrong, it’s often safer to do small move than a big one. However, in our example, when the last feature was taken out and that the data shows user behaving differently, there is something missing in the data. The data doesn’t show that the user got irritated by all these features removed across the last few months. The user is now at his/her limit and leave — not particularly because of the last feature, but because of the sum of all of them. The conclusion drawn will be that the last feature should be back, but the reality is that it won’t change anything.
There is many A/B testing that is conducted the right way, which totally makes sense to base a future decision on the collected data. For example, you are testing two sign-up buttons on your page. One is getting more people to join your product in the very similar environment than the second test.
Satisfaction is always hard to get into the equation when performing A/B testing. This is why it is wise to be careful when using the term A/B testing. It might sound the right thing to do for all decisions, but there is an area where the test might end up being a fallacy that just look great since it back up an hypothesis that we cherish more than the reality.