An article form June 2014 on how Optimizely and others only do one-tail tests, which simplifies the conclusion.
Entertaining article, here are some take-aways:
- Do a test that is two-tailed (to see if there is significance for the opposite as well)
- “The short answer is that with a two-tailed test, you are testing for the possibility of an effect in two directions, both the positive and the negative. One-tailed tests, meanwhile, allow for the possibility of an effect in only one direction, while not accounting for an impact in the opposite direction.”
- Do several tests (to test if same hypothesis arises)
- Run Tests Longer (to get more variety in the users)
Some of the the hacker news comments (from January 2016) are interesting too, one mentioning that showing the variation is essential:
“In my view, the issue is not one-tail vs two-tail tests, or sequential vs one-look tests at all. The issue is a failure to quantify uncertainty.
Optimizely (last time I looked), our old reports, and most other tools, all give you improvement as a single number. Unfortunately that’s BS. It’s simply a lie to say “Variation is 18% better than Control” unless you had facebook levels oftraffic. An honest statement will quantify the uncertainty: “Variation is between -4.5% and +36.4% better than Control”.
“We just report credible intervals. We find that to be the only honest choice.
Another point to raise is the issue on the impact on the Long Time Value of the user too. So, yes we got a higher CTR in the test, but how is the impact on LTV, is it the right kind of users who are clicking on that button?