5 Mistakes To Avoid When Running A/B Tests - Issue 87
Common product test mistakes to watch for and a practical guide for successful A/B tests.
Hello analysts, and welcome to a free edition of the Data Analysis Journal, a weekly advice column about data and product analytics. If you’re not a paid subscriber, here’s what you missed this month:
SQL vs Python For Data Cleaning - I walk through the basics of common data cleaning steps and methods for both Python and SQL, comparing them with each other and offering examples of when and how to use SQL or Python.
Engagement and Retention, Part 2: Daily Active Users - a continuation of the User Engagement and Retention series with a deep focus on Daily or Monthly Active User metrics reporting. How to define active users, what events to use, which activity frequencies to follow, how to approach engagement analytics for weekly and monthly reporting, and more.
Q&Analytics - answering questions from readers: “I’ve worked at my company for over 4 years as an analyst. I am the most long-time employee on my team and the most knowledgeable, and yet I haven’t been promoted, despite working long hours and often overtime. I know our domain and data the best from anyone else on my team. I requested a promotion to a Senior analyst, but it was given to my newer coworker who has a better bond with my manager. I feel defeated and lost. What can I do to get promoted? ”
Today I wanted to share with you one of my old(ish) articles that were reposted on LinkedIn, Reddit, Towards Data Science Medium, and other blogs. At almost every team that I was a part of during my 10+ years tenure as an analyst, each one would continually run A/B tests. Some companies are very aggressive in running product tests, some make testing part of their culture, and some simply use it as a security tool to validate new features. Regardless of the company mission, values, or roadmap agenda, as a data or product analyst, you have to ensure your team follows the right test principles, ethics, and statistical foundations.
Here are some common mistakes and misunderstandings of approaching A/B tests. Make sure you catch these early in the process to prevent data bias, irrelevant evaluation, or incomplete results.
If you are getting started with A/B tests, make sure to read the A/B One-Pager Checklist - a short guide to steps and must-know terminology.
credit to cxl
Mistake 1: Not having a hypothesis ready
Product tests have different variations: A/B, A/A, MVT, split-URL, multi-page. These are all variations of a testing subset known as hypothesis testing. It means you have to have a hypothesis - what you test against or what change you want to prove or reject.
For example:
Null hypothesis: the floating button
does not cause
at least 10% more clicks than the raised button.
Alternative hypothesis: the floating button
does cause
at least 10% more clicks than the raised button.
This can be applied to CTAs, subject lines, design layouts, user flows, or even new features.
Mistake 2: Not setting MDE before the test launch
MDE means Minimum Detectable Effect. It stands for the percentage of change you expect to see (for example, 10% more clicks). You must have MDE set to estimate the test timeline and significance. The smaller the MDE, the larger sample you need - and vice versa.
If your baseline conversion is 20%, you may set the MDE to 10%, and the test may detect 18% - 22% conversion results.
Often product owners are flexible with MDE. You can check the daily volume of traffic you have and approximately calculate whether your test can reach a 10% or 20% lift with the current volume of data (use the significance and sample size calculators below). If you see that you are likely to get less than a 2% lift, I’d question the whole test objective.
Mistake 3: Not knowing your success metric
This is the most common mistake with A/B tests. The product test aims to improve conversion in order to increase a product metric that is ideally reflected in one of your KPIs - leading to either increased revenue or growth. Make sure that you establish a connection between your conversion and the product metric. Additionally, make sure you can walk through it and prove this connection before the test launch.
For example:
10% more CTA clicks increase the signup rate twice, which improves 1% user growth or user acquisition.
10% more clicks from the new payment layout increase the upgrades rate, which potentially can improve new business MRR by 2%.
For such estimation, you should know the baseline metrics (or Control conversion) before the test launch. Therefore, make sure you can measure the metrics you want to influence: have MAU logic definition, churn calculation set, growth definition ready, etc.
Similar to MDE, the deeper in the product funnel where the change is that you want to test, the larger the sample size you will need. In other words, if you measure the test against visits or signups (top of the funnel), you will need a small sample to prove the difference. The test is likely to run fast and be straightforward. That being said, if you want to make a change for a specific conversion rate from a given upsell screen for churned / re-subscribed / active users (end of the funnel), you will need a large volume of daily data to reach significance, and the test is likely to run long with multiple dependencies.
Mistake 4: Miscommunication on the test timeline
This is tied to Mistakes 2 and 3 described above. Knowing your MDE and Baseline conversion metric, you can estimate the test timeline to reach significance and determine the test cost and effort. Depending on what data you have, how it is distributed, and what change you expect to see, there might be no appropriate timeline for your test to reach significance. It’s often recommended to run the experiment for 2 business cycles (2-4 weeks), but not every experiment fits this life cycle. Therefore, make sure to communicate to the stakeholder about long-running tests before they go live.
📢 Use this calculator to determine the needed sample size for your experiment.
📢 Use this calculator to evaluate your test significance and result.
Mistake 5: Overlapping A/B tests
Every product team is responsible for its own tests. Often, multiple product teams run various A/B tests at the same time. If you have control over the test timeline (often a product analyst approves which test goes live and when), it’s easy to prioritize the appropriate tests first and communicate the test schedule to PMs. If you don’t have control over the schedule or are being pressured under strict deadlines, compromises must be made.
Here is a guide to follow for overlapping tests:
Make sure the same user doesn’t fall into multiple A/B tests. If the testing instrumentation doesn’t support such differentiation, you have to exclude these users from your evaluation. This will push the test timeline, as you are likely to end up with not significant data.
Make sure overlapping tests evaluate different product metrics. For example, you can focus on the volume of activity for measuring one test but look at the payment conversion for evaluating another. If multiple tests are tied to the same metric, you will have a hard time estimating which test and to what degree it contributes to the lift in your conversions.
Watch out for platform overlapping tests (mobile and desktop) that are testing different user flows for the same feature. Often, there is no way to bucket these users into separate groups for A/B tests (for example, prevent mobile users from one experiment getting into desktop tests), so you might have to define user segments and run tests to only the relevant audience. This often requires more sophisticated test applications.
I also wanted to add that you should be prepared for the Variant to not show significant improvement. In 90% of cases, it performs very close to Control with slight or no difference. Often product teams chase this small 0.05% improvement and spend a lot of time on the test setup, instrumentation, and evaluation. That’s why, for an analyst, it is very important to understand the test hypothesis, the whole product picture, and the test specifics to provide the right recommendation.
Related publications:
Thanks for reading, everyone. Until next Wednesday!