Ron Kohavi’s Post

View profile for Ron Kohavi, graphic

Vice President and Technical Fellow | Data Science, Engineering | AI, Machine Learning, Controlled Experiments | Ex-Airbnb, Ex-Microsoft, Ex-Amazon

Is there p-hacking in e-commerce A/B Testing? A new paper by Alex P. Miller and Kartik Hosanagar (https://lnkd.in/gCBgfWh8 ) claims they can’t find evidence of that based on 2,270 experiments conducted by 242 firms. This is very different than Ron Berman et al's prior paper, which claimed heavy p-hacking. Key difference: the prior paper looked at Optimizely data from 2014, which at the time encouraged p-hacking (as a feature), but that statistically naïve “feature” was fixed in 2015 (see https://lnkd.in/gbVWtXh and Peter Bordens' post on how he was almost fired: https://lnkd.in/gF9k3vBk). The new paper uses data from a different (unnamed, but “large U.S.-based”) vendor. P-hacking is intentional or unintentional misapplication of statistics to achieve statistically significant results using human degrees of freedom, such as ending experiments early, processing outliers, looking at segments, etc. It used to be a big problem 10 years ago, but awareness has risen and while I think it still occurs (mostly unintentionally), it is much less frequent today. Still, I strongly recommend everyone run A/A tests (see Chapter 19 of https://lnkd.in/eWuqBVw) and look at the actual distribution of p-values from your experimentation platform to see if there’s an unreasonable discontinuity around alpha (usually 0.05). Want to learn more about A/B testing and trust? I teach an interactive 10-hour course: https://bit.ly/ABClassRKLI and an advanced course: https://lnkd.in/gU9xrezE #abtesting #twymansLaw #AATest #peeking Leonid Pekelis Aisling Scott, Ph.D. Christophe Van den Bulte Uri Simonsohn Ulrich Schimmack

  • No evidence of p-hacking
Gareth Wilson

Senior Product Manager

1mo

Hi Ron Kohavi, you mentioned that pre-2014 Optimizely encouraged p-hacking. As noted in the article, the platform studied here (unnamed) also doesn’t explicitly advise stopping tests only once the sample size is reached. The article also states it moves metrics to a separate section as soon as they hit statistical significance. How does this differ from what Optimizely was doing in terms of discouraging p-hacking? It seems like this platform could also be subtly encouraging p-hacking.

Kartik Hosanagar

AI, Entrepreneurship, Digital Transformation, Mindfulness. Wharton professor. Cofounder Yodle, Jumpcut

1mo

Thanks Ron for sharing our paper. I believe you have seen a prior conference version of this paper many years ago. It took us some time to get it out in print. At some level, it's natural for untrained testers to fall into the trap of looking for statistical significance rather than the truth. It was a pleasant surprise to find a null result in our meta analysis (and our data is from a similar period as the Berman paper). As an aside, i like your definition: "P-hacking is intentional or unintentional misapplication of statistics to achieve statistically significant results." We once had a referee strongly object to use of the term for unintentional misuse. Took us a lot of effort to convince them. Keep up your posts on testing. I enjoy them.

Sarnath Kannan

Staff data scientist, MAFer, Georgia Tech Alumni

1mo

Discontinuity around Alpha -- Thats a great idea! Thanks!!!

Leo Murillo

Software Engineering Manager at Amazon Fulfillment Technologies

1mo

Saved

Like
Reply
Michael Hughes

Growth @ Eclipse | Helping Ecommerce brands grow through experimentation

3w

These concepts will always be difficult for those only lightly involved with experimentation to fully grasp. Like Ron Kohavi, running an A/A test our preferred way of illustrating how a result matures as the sample size grows

Paulo Saavedra

Consultor UX Lead / Estrategia y Gestión / Productos Digitales / Gestor de Proyectos

1mo

Esto es genial

Alex P. Miller

Assistant Professor, USC Marshall School of Business

1mo

Thanks for posting, Ron! Our data come from a similar time period (2014-2016). I think you're right highlighting this was a bigger deal back then, but maybe that makes our results all the more surprising!

See more comments

To view or add a comment, sign in

Explore topics