✍️ Steps for conducting a product experiment:
Can you test it?
You can’t A/B test every little thing. New experiences or new product releases can’t be run through the A/B test (read - How To Measure Product Adoption). Potential bias - novelty effect or change aversion.
Formulate A Hypothesis
Why do you have to run the experiment? What is the ROI? Is it a good time to run the test? Consider seasonality, new version releases, open bugs, etc.
Set the rate you expect - this is your Minimum Detectable Effect (MDE). This is the smallest acceptable difference between the Control and Variant. If the Variant is 0.0001% better than the Control, would you still want to run the test? Is it worth the cost and time?
Finalize your set of metrics
For A/B analysis, I use a set of 3 metrics:
Success metrics
Ecosystem metrics (company KPIs)
Tradeoff metrics
More described here - How To Pick The Right Metric.
Define your audience:
Is the change relevant for new or active users, or all? The region, platform, and language? If you work with matured analytics, do you need to limit the test exposure to some type of persona?
The more user attributes and filters you add, the longer the test will likely run, as it reduces your sample size. That being said, it reduces variance as well, and thus, it will be more precise.
Calculate sample size:
Set your significance, confidence interval, and power.
Your group experiment sizes should be the same.
Your sample should be randomly distributed. Recognize traffic, devices, returning users, etc. Work with the engineering team on testing and ensuring that the randomization algorithm works as expected (hashing, clustering, sample stratification?).
Make sure there is no bias introduced with other tests running.
Run the test until you reach significance and a little longer. Monitor the test timeline and events.
Evaluate results:
Run sanity checks. Control metrics and conversions should match the Baseline. If they don’t, question the test setup.
Check sample variance and distribution. High variance often leads to low trust.
Run spot checks. Pick a few users from Control and Variant samples and check them to ensure they are random, not overlapping with other tests, and are meeting the test requirements.
If the result is not what you expected, think of the potential bias - novelty effect, learning effect, network effect.
Draw conclusions and provide a recommendation on the next steps to product owners.
🔥 Things to remember:
Run the A/A test first. It helps you check the software, outside factors, and natural variance. You would need to know the sample variance to estimate the significance level and statistical power.
Don’t pick metrics that are either too sensitive (views) or too robust (Day 7 or Day 30 retention). They are not helpful and tend to mislead you. The best test metric would show a change in the result and would not fluctuate much when other events are occurring.
Don't run the experiment for too long, as you might experience data pollution - the effect when multiple devices, cookies, and other outside factors affect your result.
Don’t run the experiment for too little time either, as you might get a false positive (regression to the mean). In other words, when a variable is extreme at first but then moves closer to the average.
When introducing a new change, run the test on a smaller sample for a longer period of time to eliminate the novelty or learning effect bias.
💣 Statistical terminology
To approach A/B testing, you can think of Null-Hypothesis testing and apply the following terms:
P-value - assuming Null-H is true, what is the probability of seeing a specific result? If data is in the "not expected" region, we reject Null-H.
Statistical Significance (or Significance level, alpha) is the probability of seeing the effect when none exists (false positive).
Statistical Power (or 1-beta) is the probability of seeing the effect when it does exist.
Confidence Interval is the number of allowed errors or measurements of estimated reliability: the smaller CI, the more accurate the result.
z-score is the number of Standard Deviations from the mean.
🤔 If you are lost in conversions and numbers, check this guide:
If your baseline conversion is 20%, you may set the MDE to 10%, and the test may detect 18% - 22% conversion results.
The higher your baseline conversion, the smaller the sample size you’ll need.
The smaller the MDE, the larger sample you’ll need.
Low p values are good. They indicate the result didn’t occur by chance.
Your significance level could be 95% and statistical power - 80%
It’s often recommended to run the experiment for 2 business cycles (2-4 weeks)
📢 Use this calculator or this one to determine the needed sample size for your experiment.
📢 Use this calculator to evaluate your test significance and result.
🔍 Other types of product testing
Multivariate testing (MVT) - multiple variants and their combinations within a single test.
Split URL testing - multiple versions of your webpage posted on different URLs.
Multipage testing - testing changes across different pages. There is both funnel Multi-Page testing and Conventional Multi-Page testing. Read more here.
Check out this guide if you want an A/B experiment checklist.