Why A/B testing is bad for your startup
Data science you need to know.
This article is a summary of this lengthy, but oh so amazing, piece about how badly performed A/B tests can produce winning results which are more likely to be false than true. The bad news? You can really put your business at risk with bad A/B testing.
Rule #1: have the right number of dogs
Let's say you are trying to answer the following question: are adult dogs or kittens heavier?
You might want to weigh more than 1 kitten and 1 adult dog, because if not, you have a real chance of picking an extraordinary small dog. (Have you seen Yuki, our office dog?).
Statistical significance is just that. You need a large enough number of events (purchases, clicks, subscribes, etc.) to avoid too many false positives. This proves that one option (A or B) is statistically superior to the other, and that you should indeed listen to the A/B test result.
Reducing the length or number of conversions of your A/B test dramatically increases the proportion of false positives in your results.
How many events do you really need?
If you are very early stage, A/B testing might just not be the right method for you. You can still gather incredible data by talking to your users!
A simple rule of thumb is to use 6000 conversion events in each group if you want to detect a 5% uplift. Use 1600 if you only want to detect uplifts of over 10%. These numbers will give you approximately an 80% chance to detect a true effect.
Rule #2: don't underestimate how bad things can go
New variants of websites are overrated.
Data from Google suggests that 90% of the time, they have a negative effect, or no effect on the website performance (1).
Beware of hidden negative effects!
If you shorten your test to only detect uplifts of over 10%, you also miss negative effects on your conversion of less than 10%. You could cut your conversion quite significantly without realizing it.
Don't stop tests as soon as you see positive results
This, also, dramatically increases the chances of false positives. A majority of your results (2) could be bogus if you stop the test immediately after seeing positive results.
The novelty effect
Any winning test whose results tend to decrease over time, indicates a potential false positive. If the results deteriorate over time, it means that the uplift just wasn't there to start with. It could be some kind of novelty effect, or simply a statistical phenomenon called regression to the mean.
Example: If you give 100 students a test that they don't know the answers to, their answers will be completely random. If you take the "winning" 10% of the students and retest them, chances are the results of the second test will deteriorate. This will be consistent for the 3rd, 4th, 5th tests and so on.
To avoid this, let your tests run long enough (we talked about that) and ideally, try to perform confirmatory tests for the winning A/B tests.
The winner's curse
Another bias is that we tend not to declare a winning test unless the uplift is very significant. This causes A/B tests to generally overestimate uplift, because if the uplift is reasonable, we are less likely to declare the test a "winner".
Rule #3: Focus
There is a big chance that even a positive result won't produce a real uplift on your product. It is very time-consuming and confusing to run multiple test, let alone all at once.
A good experiment should be focused, researched, with a specific actionable outcome. It should also feed into the broader product vision, in order to really move the needle and justify the resources they will be using (think about the cost of opportunity, especially if you are early stage).
As Fareed Mosavat, a former Director of Product at Slack, puts it in this great article: Businesses aren’t built on optimizations.
Sources & details
(1) Source: MANZI JIM, Uncontrolled: The Surprising Payoff of Trial-and-Error for Business, Politics, and Society (2012).
(2) The authors of the article which this post is based on, found a 41% false positive rate. If we assume 10% of variants have a real effect, at most 10 of 100 tests will be true positives. Out of 100 tests, on average 41% will therefore be false positives and 10%, true positives. Of 51 winning tests, over 80% will actually be false.