The New Age of A/B Tests - Issue 99
The ever-changing framework of A/B tests analysis - how to adapt to it, scale, and foster analytics while keeping statistical confidence.
Welcome to another edition of Data Analysis Journal, an advice column about data and product analytics. If you’re not a paid subscriber, here’s what you missed this month:
Google Analytics Termination: Stay Calm And Keep Tracking - the end of Google Universal Analytics, the new age of Google Analytics 4, how it affects the digital analytics niche, and how to take advantage of it.
Python Pandas DateTime Reference Guide - DateTime transformations in Python Pandas - a compilation of tips, solutions, and workarounds for any possible case with DateTime formatting.
Engagement and Retention, Part 3: User Retention Deep Dive - a continuation of the User Engagement and Retention series with a focus on how to approach and calculate retention in SQL, how to set retention metrics and KPIs, the difference between retention types, and the impact of retention on user growth.
We had an eventful June: Snowflake Summit in Las Vegas, Amplitude Summit (also in Vegas), a virtual Twillio Segment CDP Week happening this week, and Data Stack Summit starting today. If you are in San Francisco and just hate sleep, join us at Spark at Dark next week for drinks, games, and data conversations! And if you are not, you can join for free Data + AI Summit and learn about modern data stack.
This week I wanted to talk about A/B tests again, and share my musings and concerns about modern trends pushed in product experimentation.
A few weeks ago, I shared the AppFigures chat Why People Don’t Buy Your Subscriptions, where app experts talk about doubling conversion rates and modern experimentation. Upon reflection, it hit me how aggressive and demanding expectations are for testing culture. The modern approach to A/B tests makes me question its methods and wonder if the current analytical school of experimentation is actually set up for it.
A/B testing is often the culprit of tension between analysts and product owners. After reading success stories at Meta, LinkedIn, or Airbnb, many product leaders are inspired to follow the “modern trend” of constantly iterating with user flows, copies, CTAs positioning, layouts, etc. Specifically, they expect:
A never-ending testing lifecycle with many tests running in parallel.
Ease of configuring the test audience, text, and copy.
Flexibility with adjusting traffic volume, launching, pausing, and reverting rollouts.
Instant test results and the impact analysis on metrics and KPIs.
Then analysts come in and question the test objective, set up, the baseline metrics, and experimentation toolkit. Quite often they have to push back on the test launch, delay the test results readout, or even disregard them. They may come across as an enemy of progress when they’re really just trying to do their job.
While many A/B test issues can be solved by simply improving governance, procedure, and test protocols (read 5 Mistakes To Avoid When Running A/B Tests and follow my A/B Test Checklist), in some cases even if you follow the guidelines and do everything right, you are still at the mercy of the current testing instrumentation and limited analytics at your company.
Why we can’t have nice things
Your ability to test is linearly correlated with how well your analytics is set up. It’s different at every company, and it’s always changing and evolving.
After supporting hundreds of A/B tests at different companies, I noticed these common challenges related to A/B tests rollouts, monitoring, and analysis:
Missing events tracking. It’s common for the test to be launched without set tracking for page view or needed conversion. Or the tracking returns 50% or less expected volume.
Missing Baseline data. Quite often PMs launch the test to improve CVR without knowing its baseline - not because they didn't care to research, but because there often is no data for it. (This is not related to new feature rollouts and product adoption where it is expected you do not have a Baseline).
The inability of cross-platform testing. This is common for products that have a website and an app. The challenge to locate users who use multiple platforms is still not easily solved even today.
Complexity with targeting a specific user audience for the test. Even though many tools today support sophisticated audience groupings with many properties and attributes, a common issue often comes up in the timeline of users receiving and losing these attributes to be qualified for the test.
Test instrumentation limitation to keep proportional segmentation for Control and Variants. Too often these groups are not proportional and not equal by size.
Test instrumentation limitation to randomly distribute users across multiple properties and attributes. This is a big one and is happening more commonly than you think, and can be difficult to confirm/prove from the analytics side. By traffic dissemination “nature”, I keep noticing that Variant groups may contain more recent and active users than Control or Baseline. This is very dangerous.
The inability to exclude users who are getting into multiple tests. It’s often double to do using SQL if you are lucky to get experimental data in a database. This is still a manual step, and it should ideally be done in the test rollout tool itself.
The inability to exclude test users, bots, and fraud test participants (again, can be done in SQL with some extent of creativity).
Many of these issues lead to either (1) limited test toolkit or (2) premature analytics. Or both.
Buying testing software vs developing own
For the former, at small and mid-stage companies the question often arises: should we buy test software or develop the instrumentation in-house? The answer will depend on the nature of the product, the volume of data and traffic, the platforms used, and the current data and analytics stack.
After going through both routes multiple times, here is what I wish my team and I knew:
Integrating with a vendor software might take the same or even longer amount of time than developing your own toolkit.
Many SaaS A/B test vendors will struggle with supporting cross-platform (website + apps) analytics and testing.
With a vendor route, don’t lock test analysis to offered metrics and dashboards. Scope the data pipeline of the testing events to a database and expect to use your current visualization tools for the test readout.
When developing your own toolkit, scope extensive QA and debugging work. You will keep running into more and more user cases getting into the wrong groups.
(note: this is not relevant for enterprises, where multiple teams spend years developing and maintaining experimentation infrastructure)
To summarize, after working with many A/B tests vendors, I’ve continued to notice a pattern:
With vendor tools, you are likely to save time on statistics and setting up rollouts. They usually support advanced significance and confidence estimations for all types of tests (A/B, split URL, multivariate). You will enjoy playing with user groups and cohorts, and setting up rollout size and attributes will be a breeze. You will enjoy a fancy dashboard where you’ll be able to watch the status and % of all your tests and rollouts. But no, you will not save time on analytics and the test readout. You are likely to struggle to merge data with your internal BI tools. Your KPIs and metrics will be “disconnected” from the vendor’s metrics, and quite often not in line with the test impact and results.
With an in-house tool, you are likely to save time on test analysis. You don’t need to worry about setting up new analytics for seeing the impact of the test on your metrics and KPIs. Your in-house instrumentation will most likely be configured to meet your product and business definitions and events. You may struggle with statistics and significance and may end up manually calculating confidence, sample sizes, and estimated timeline. You are likely to have limitations with the rollout procedure, like doing a slow rollout (5% → 10% → 25% → 50%) and managing multiple rollouts and tests simultaneously.
Premature analytics
Your A/B tests instrumentation may fit your testing needs, but the analytical state and data maturity at your company might still be lagging behind. For example, you might not have the access to structured data and the ability to JOIN experimental events with other data domains to create funnels and waterfalls or connect to BI applications to enhance and speed up analysis.
If that’s the case, as an analyst, you will have to move slowly, carefully, and “test all the things” unlikely to happen.
Here are 10 action items you can practice to improve test reading, validation, significance, and confidence in your results:
Know your Baseline conversion for every test. Ideally, measure it multiple times, know its range, and compare averages. This will help you identify if your Variant conversion is way off or suspiciously high or low.
Avoid running experiments on complex user groups that have many attributes. The more attributes you introduce, the more complex the experiment becomes.
Run the A/A test first. It helps you check the software, outside factors, and natural variance.
Check sample variance and distribution (if you can).
Avoid slow and disproportional rollouts. For example, once the test has been released to 10% traffic, do not reduce its size. Every time the change is made, significance and results will be recalibrated.
Make sure overlapping tests evaluate different product metrics. For example, you can focus on the volume of activity for measuring one test but look at the payment conversion for evaluating another. If multiple tests are tied to the same metric, you will have a hard time estimating which test and to what degree it contributes to the lift in your conversions.
When introducing a new change, run the test on a smaller sample for a longer period of time to eliminate the novelty or learning effect bias.
Run sanity checks. Control metrics and conversions should match with the Baseline. If they don’t, question the test setup. The results might be misleading.
Run spot checks. Pick a few users from Control and Variant samples and check them to ensure they are random, not overlapping with other tests, and are meeting the test requirements.
If the result is not what you expected, think of the potential bias - novelty effect, learning effect, network effect.
As you see, agile experimentation is the output of many factors. Supporting one test is a lot of work. Developing a framework for monitoring and analyzing 20-50 tests that run simultaneously with high result precision and confidence might not be doable for every company. And even if it is, after going through so many edge cases and tricky scenarios, I wouldn’t trust it very much.
With all of this, I’d like to wrap it up with my little personal and maybe not-popular opinion that A/B tests are overrated. In my practice, I have yet to see a test that drove significant user growth or had a noticeable revenue impact. Most of the tests are flat or slightly increase downstream conversions. But they take a lot of effort to get designed, launched, and analyzed. Product optimizations drive “easy-win” users that are likely to churn or abandon your product within a few months.
Introducing new features, experiences, and incentive programs have proven to be the most effective growth factor. I won’t say you shouldn’t test and optimize at all. But maybe not everything at once - take a slower pace and trust your analyst.
Thanks for reading, everyone. Until next Wednesday!
Hello Olga,
How do you calculate the baseline?