Is there p-hacking in e-commerce A/B Testing? A new paper by Alex P. Miller and Kartik Hosanagar (https://lnkd.in/gCBgfWh8 ) claims they can’t find evidence of that based on 2,270 experiments conducted by 242 firms. This is very different than Ron Berman et al's prior paper, which claimed heavy p-hacking. Key difference: the prior paper looked at Optimizely data from 2014, which at the time encouraged p-hacking (as a feature), but that statistically naïve “feature” was fixed in 2015 (see https://lnkd.in/gbVWtXh and Peter Bordens' post on how he was almost fired: https://lnkd.in/gF9k3vBk). The new paper uses data from a different (unnamed, but “large U.S.-based”) vendor. P-hacking is intentional or unintentional misapplication of statistics to achieve statistically significant results using human degrees of freedom, such as ending experiments early, processing outliers, looking at segments, etc. It used to be a big problem 10 years ago, but awareness has risen and while I think it still occurs (mostly unintentionally), it is much less frequent today. Still, I strongly recommend everyone run A/A tests (see Chapter 19 of https://lnkd.in/eWuqBVw) and look at the actual distribution of p-values from your experimentation platform to see if there’s an unreasonable discontinuity around alpha (usually 0.05). Want to learn more about A/B testing and trust? I teach an interactive 10-hour course: https://bit.ly/ABClassRKLI and an advanced course: https://lnkd.in/gU9xrezE #abtesting #twymansLaw #AATest #peeking Leonid Pekelis Aisling Scott, Ph.D. Christophe Van den Bulte Uri Simonsohn Ulrich Schimmack
Thanks Ron for sharing our paper. I believe you have seen a prior conference version of this paper many years ago. It took us some time to get it out in print. At some level, it's natural for untrained testers to fall into the trap of looking for statistical significance rather than the truth. It was a pleasant surprise to find a null result in our meta analysis (and our data is from a similar period as the Berman paper). As an aside, i like your definition: "P-hacking is intentional or unintentional misapplication of statistics to achieve statistically significant results." We once had a referee strongly object to use of the term for unintentional misuse. Took us a lot of effort to convince them. Keep up your posts on testing. I enjoy them.
Discontinuity around Alpha -- Thats a great idea! Thanks!!!
Saved
These concepts will always be difficult for those only lightly involved with experimentation to fully grasp. Like Ron Kohavi, running an A/A test our preferred way of illustrating how a result matures as the sample size grows
Esto es genial
Thanks for posting, Ron! Our data come from a similar time period (2014-2016). I think you're right highlighting this was a bigger deal back then, but maybe that makes our results all the more surprising!
Senior Product Manager
1moHi Ron Kohavi, you mentioned that pre-2014 Optimizely encouraged p-hacking. As noted in the article, the platform studied here (unnamed) also doesn’t explicitly advise stopping tests only once the sample size is reached. The article also states it moves metrics to a separate section as soon as they hit statistical significance. How does this differ from what Optimizely was doing in terms of discouraging p-hacking? It seems like this platform could also be subtly encouraging p-hacking.