A/B Test Checklist - Issue 72

A short guide to product experimentation steps and must-know terminology.

Dec 08, 2021

Hello analysts! This week I wanted to continue the theme of Statistics and talk again about the never-ending topic of A/B tests.

There are so many good materials already written on running A/B tests, and yet it can be so difficult to search through them to find the right explanations for basic questions. But I’m here to help! Last year I published the A/B One Pager. This is a brief guide covering only the essential information you’ll need - steps, concepts, and must-know terminology. It’s aimed more towards junior analysts who are getting started with experimentation. I ended up using it quite a bit as a checklist - a resource to quickly pull from for things like significance calculators.

Today I want to “upgrade” my A/B One Pager to include more gotchas, context, and analytics. As I keep saying, you don’t mess with statistics. There is a lot to add to it and things to remember.

So keep that list close and reread as needed.

✍️ Steps for conducting a product experiment:

1. Can you test it?

You can’t A/B test every little thing. New experiences or new product releases can’t be run through the A/B test (read - How To Measure Product Adoption). Potential bias - novelty effect or change aversion.

2. Formulate A Hypothesis

Why do you have to run the experiment? What is the ROI? Is it a good time to run the test? Consider seasonality, new version releases, open bugs, etc.

Set the rate you expect - this is your Minimum Detectable Effect (MDE). Why do you need to have the MDE? This is the smallest acceptable difference between the Control and Variant. If Variant is 0.0001% better than the Control, would you still want to run the test? Is it worth the cost and time?

3. Finalize your set of metrics

For A/B analysis, I use a set of 3 metrics:

Success metrics
Ecosystem metrics (company KPIs)
Tradeoff metrics

More described here - How To Pick The Right Metric.

4. Calculate sample size:

Set your significance, confidence interval, and power.
Your group experiment sizes should be the same.
Your sample should be randomly distributed. Recognize traffic, device, returning users, etc. Work with the engineering team on testing and ensure that the randomization algorithm works as expected (hashing, clustering, sample stratification?).
Make sure there is no bias introduced with other tests running.

5. Run the test

until you reach significance. Monitor the test timeline and events.

6. Evaluate results:

Run sanity checks. Control metrics and conversions should match with the Baseline. If they don’t, question the test setup.
Check sample variance and distribution.
Run spot checks. Pick a few users from Control and Variant samples and check them to ensure they are random, not overlapping with other tests, and are meeting the test requirements.
If the result is not what you expected, think of the potential bias - novelty effect, learning effect, network effect.

7. Draw conclusions

and provide a recommendation on the next steps to product owners.

🔥 Things to remember:

Run the A/A test first. It helps you check the software, outside factors, and natural variance. You would need to know the sample variance to estimate the significance level and statistical power.
Don’t pick metrics that are either too sensitive (views) or too robust (Day 7 or Day 30 retention). They are not helpful and tend to mislead you. The best test metric would show a change in the result and would not fluctuate much when other events are occurring.
Don't run the experiment for too long, as you might experience data pollution - the effect when multiple devices, cookies, and other outside factors affect your result.
Don’t run the experiment for too little time either, as you might get a false positive (regression to the mean). In other words, when a variable is extreme at first but then moves closer to the average.
When introducing a new change, run the test on a smaller sample for a longer period of time to eliminate the novelty or learning effect bias.

💣 Statistical terminology

To approach A/B testing, you can think of Null-Hypothesis testing and apply the following terms:

P-value - assuming Null-H is true, what is the probability of seeing a specific result? If data is in the "not expected" region, we reject Null-H.
Statistical Significance (or Significance level, alpha) is a probability of seeing the effect when none exists (false positive).
Statistical Power (or 1-beta) is the probability of seeing the effect when it does exist.
A Type 1 error is a false positive or the rejection of a true null hypothesis.
A Type II error is a false negative, or the failure to reject a true null hypothesis.
Confidence Interval is the number of allowed errors or measurement of estimated reliability: the smaller CI, the more accurate the result.
z-score is the number of Standard Deviations from the mean.

🤔 If you are lost in conversions and numbers, check this guide:

If your baseline conversion is 20%, you may set the MDE to 10%, and the test may detect 18% - 22% conversion results.
The higher your baseline conversion, the smaller the sample size you’ll need.
The smaller the MDE, the larger sample you’ll need.
Low p values are good. They indicate the result didn’t occur by chance. A p-value of less than 0.05 indicates that the results aren’t random.
Your significance level could be 95% and statistical power - 80%
It’s often recommended to run the experiment for 2 business cycles (2-4 weeks)

📢 Use this calculator or this one to determine the needed sample size for your experiment.

📢 Use this calculator to evaluate your test significance and result.

☠️ If your test outcome doesn’t make any sense, most likely one or more mistakes happened:

Testing all the things - applying too many changes and testing too many variables in one single test ( a mistake at Step 2).
The Baseline metrics are off - if you didn’t define what you have to measure, you can’t quantify the test (a mistake at Step 3).
Not significant data - if you don’t have enough traffic, you are unlikely to run an effective A/B test (a mistake at Step 4).

🔍 Other types of product testing

Multivariate testing (MVT) - multiple variants and their combinations within the single test.
Split URL testing - multiple versions of your webpage posted on different URLs.
Multipage testing - testing changes across different pages. There is both Funnel Multi-Page testing and Conventional Multi-Page testing. Read more here.

Check out this guide if you want an A/B experiment checklist.

Thanks for reading, everyone. Until next Wednesday!

Data Analysis Journal