"Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing" Book review and Notes.
Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing, written by Ron Kohavi, Diane Tang, Ya Xu. They are experimentation leaders at Airbnb, Microsoft, Amazon, Linkedin, Google. In the book, you will find all information that you need for your A/B tests:
How to improve the way your organization makes data-driven decisions?
How to correctly define success metrics for your experiments?
How to avoid a lot of pitfalls?
How to evaluate the hypothesis with statistical methods?
How to build a scalable experiment platform?
And you will find a lot of practical, real-world examples.
Is it worth to read it?
Definitely yes. It uncovered a lot of AB test nuances. This book is gold and a must-read for any professional that works with A/B tests (Product Manager, UX/UI designer, Product Analyst, Data Analyst, CRO expert, CEO, CTO, etc.).
Below I will share some insights and examples from it by chapter.
Ch1: Introduction and Motivation.
One accurate measurement is worth more than a thousand expert opinions (c)
This chapter introduced necessary information about A/B tests and some experiment examples:
How small change in ad headline display brought an additional $100M/year at Bing Search. This experiment was delayed for months because nobody believed in it. Also, they briefly write about the importance of choosing the right metric. To look at revenue is not enough. You also need to measure user experience because they want to give user value so users will use the search again.
Google/Bing Ads examples, Malware reduction, Backend experiments, and much more.
Terminology: OEC(Overall Evaluation criteria), Parameter, Variant, Randomization Unit, Sample Size, Control, Treatment, etc.
Ch2: Running and Analyzing Experiments.
The fewer the facts, the stronger the opinion (c)
This chapter introduced an end to end example, the basic principles of designing, running, and analyzing an experiment:
Examples. I liked the example with fake door or painted door approach. Case: The marketing department wanted to increase sales by sending discount promo codes to their users. But they were concerned that adding the coupon code field to checkout will decrease revenue, even if there are no coupons. Some users can try to search promo codes on the internet, slow down, or even abandon. So they just added a coupon field on the checkout page to test if it impacts metrics and distracts users (if users interact with this field, abandon, even we don't have any coupons).
How to correctly choose the right metric? Is it ok to take revenue or better to take revenue-per-user?
Hypothesis testing: Establishing statistical significance (sample size, p-value, confidence intervals, standard errors, statistical power).
Some hints on detecting a smaller change or being more confident in the experiment results: use purchase indicator (did a user purchased or not) instead of revenue-per-user, the standard error will be smaller, increase sample size, or try to detect a higher difference in KPI, etc.
Before analyzing the test, it's crucial to check sanity metrics like the same sample size for Control and Treatment group or latency. To verify that we launched an experiment correctly and without bugs.
From results to decisions. You need to take into consideration both the conclusions from the measurement and the broader contest: tradeoffs between different metrics, cost of fully build out the feature before launch, cost of ongoing engineering maintenance after launch.
Examples for understanding statistical and practical significance when making launch decisions.
Ch3: Twyman's Law and Experimentation Trustworthiness
Twyman's Law: "Any statistic that appears interesting is almost certainly a mistake." (c)
Here they write about:
the misinterpretation of the statistical results like lack of statistical power, misinterpreting p-values, peeking at p-values, multiple hypothesis tests, confidence intervals;
threats to internal validity: violations of SUTVA, survivorship bias, intention-to-threat, sample ratio mismatch (SRM);
threats to external validity: primacy effects and novelty effects;
Good data scientists are skeptics: they look at anomalies, they question results, and they invoke Twyman's Law when the results look too good. (c)
Ch.4: Experimentation Platform and Culture
If you have to kiss a lot of frogs to find a prince, find more frogs and kiss them faster and faster (c).
Other interesting thoughts for me:
The slowdown experiment design. Page loads in different chunks, so it may be important which part do you intentionally slow down. You may pay attention only to essential elements (like the search bar and don't care about side elements).
Goal metrics(success metric or north star metric). Driver metrics(indirect, predictive metrics, a leading indicator for goal metrics). Guardrail metrics (metrics that protect the business or asses the trustworthiness and internal validity of experiment results).
You can buy this amazing book on Amazon.