A Guide To A/B Test Evaluation - Issue 46
How to measure A/B test success, and why product experiments can be tricky and not always effective.
Let’s say you have successfully shipped an experiment with a winning +15% increase in your conversion. But somehow, this isn’t reflected in your revenue data yet. Even more concerning, it didn’t improve overall retention nor increased LTV as you expected (and the test proved). Before you question the experiment data or blame your analyst, read this article where I focus on the foundation of product experimentation, rollout procedures, methods to evaluate and measure its performance, and how to best estimate an impact.
When I am asked to evaluate or guide a product experiment, I start with trying to understand what type of product change it should be. I prefer to differentiate experimentation between introducing a new feature or optimizing an existing feature and classify all product tests into 3 categories:
Optimizing an existing product or feature.
Introducing a change to an existing product or feature.
Introducing a new product or feature that didn’t exist before.
While these sound similar (and in many, many companies, these are treated the same), in fact, they have a different lifecycle and rollout and should be evaluated differently. Let’s break them down to understand why you should approach and evaluate these experiments differently.
Optimizing an existing product or feature.
From Optimazely: What Is A/B Testing?
This is the most common A/B test type. You simply change the color, format, positioning of a known (existing) feature that doesn’t change the user’s path but is intended to optimize the experience. Many companies have multiple such tests running at a given time. These are the types of experiments that can run simultaneously (if set correctly, they don’t interfere with each other, and are exposed to different user groups). Usually, these tests are fast with low impact. It’s rare that you notice significant conversion change, and most likely there will be low variance in your results.
A full disclosure: if you know me, you probably know that I’m not a fan of these types of tests and sometimes recommend pushing back on them. If users intend to purchase or sign up, it doesn’t matter if your CTA is green or blue, or is at the top of the page, or bottom on the left side, or even has exploding confetti around it. If there is an intent, users will find a button and manage to engage or complete the flow. I know, I know, there are many studies out there proving that with the right color or layout you can trick, attract, or win your customers. But based on my experience evaluating hundreds of these tests, these “won” users are very likely to churn or stagnate, and won’t contribute significantly to your revenue, and your ROI will be low. And yes, I often have long debates with UI and UX teams on this who want to prove me wrong. But data and Power Users speak for themselves. Not convinced? Check out the Conservation of Intent from Andrew Chen which discusses exactly this point.
Introducing a change to an existing product or feature.
Here things start getting interesting for an analyst, and this is where you can potentially move the needle. In this case, you introduce a change to a known feature that affects the user journey or path. It may be a test to reduce or add more steps for the signup form, re-route users through different page flow to get to your value pitch or adding/removing CTAs, or anything that introduces a new user behavior or path.
These tests can get quite complex, and can have many hidden traps:
Changing too many things at once: If you want to simplify the flow from 9 steps to 5 steps for a signup or a purchase form, you have to test, one by one, every variation of a new flow (drop one step at a time) and that can be timely and expensive. But if you won’t do it and apply multiple changes at once, and your Variant loses to Control, you won’t know what exactly caused it - was it because of dropping step 5 or step 6, or a new layout? How can you be sure? Then, you will have to come back and go through the rollout again and design new variants, and that ends up in even a longer timeline and more effort. My rule of thumb is - test first for the most optimal user flow, then optimize it for the layout.
Measuring against wrong Baseline metric: adding/removing CTAs is tricky for picking the right Baseline metric. For example, your signup → purchase conversion might not be relevant to use for testing a Variant with a signup → invite friends → purchase funnel. Even though you might get fewer purchases from the Variant, your users may still generate more recruits or leads through a new share action that might convert into more purchases over time. You won’t see this effect in your Control/Variant evaluation. But the test can be succeeding.
Missing secondary CTAs: if you remove pages or reroute users into a new flow, it’s easy to miss a step with secondary CTAs on the dropped page. While your Variant can win, your engagement/revenue can lose.
There are many such tricky cases that make the analyst’s world a charming challenge.
Once I was revisiting an experiment that was marked by an analyst on my team as not successful, but with a different rollout strategy that included targeting another audience, it performed extremely well and boosted our MAU 2x times.
I remember one experiment when we tried to optimize the onboarding funnel and dropped a few pages to simplify it. What could go wrong, you might say? One of the dropped pages contained a share button. The test showed improvement, as more users in Variant completed the flow, and the PM was about to call it a success. But coincidentally, I noticed reduced views traffic that led to reduced signups and, thus, conversions that wouldn’t be reflected in our experimental data. It took me a while to figure out the connection between the test and the views decrease I was seeing, as it wasn’t obvious why optimizing onboarding flow for registered users would cause reduced recruits rate, and then we had to revisit and change our designs.
That’s why you need to have an analyst look into any experimentation you are planning or running. Simple questions can become quite complex when you start breaking these down on a timeline, user segments, or engagement layers. Today, some of the testing tools offer thorough evaluation support that is automated and can reduce analysts’ work. That being said, an improvement in one metric doesn’t always mean improvement for the business. Very often, the biggest change from the experiment is happening behind the scenes.
Introducing a new product or feature that didn’t exist before
From how-to-announce-new-features
This is the hardest type of experimentation. And the most commonly misconducted and wrongfully evaluated. This is the type of product change that can significantly increase company revenue, reach, and growth. I keep wondering how come many companies consider a new feature rollout as any regular product change and treat it as a classic A/B test. This is wrong, and here is why.
When you conduct an A/B test, you set up a Baseline metric against which you evaluate Control and Variant performance. A new feature is meant to introduce a new product usage that hasn’t been available before, so you actually don’t have initial conversions ready yet to use for the test. You might have to pick proxy metrics, which should be done cautiously. Treating it as an A/B test, you will be comparing user groups with journey and behavior that are not related to each other, and thus, might not give you insight into what’s better from what.
I remember being confused why many new feature tests perform a little (or a lot) below in conversions against the expectation. For example, when you add a new product B (such as breathing and meditation exercises) that is next in the list to the existing long-timer Product A (like stretching and power exercises), most likely conversions from Product B will be lower than from Product A, even though it may offer more exercises or discounts to clients. It didn’t make any sense to me. And the reason for that is you are competing with a new raw positioning against optimized and well-tested products. Citing Andrew Chen, “A classic A/B test will often eliminate the new design because it performs worse” (How to use A/B testing for better product design) The new MVP often isn’t optimized yet to perform better or compete with other tested/improved features. This is the case when a qualitative approach (user interviews, surveys, usability testing, etc.) should be added for a new product release and success evaluation.
That’s why you shouldn’t make your decision based on a comparison of conversions between them, but instead, you might have to keep exposing more traffic to it, testing new iterations. Eventually, it will catch up.
For these tests, my recommendation is to think of a new feature rollout as Product Adoption. And instead of measuring it against the Baseline (as you do for A/B tests), zoom out and monitor (a) KPIs and (b) a set of new adoption metrics.
KPIs
For subscription-based products or services, measuring product adoption is essential. A high adoption rate leads to increased:
MRR
LTV
Trial To Subscription conversion
Free to Paid conversion
Adoption metrics:
Here are the top 2 measurements to start with:
Adoption rate:
% users who used a new feature / all existing users
This metric is useless without applying a specific timeline. For example: to get the Adoption rate for December, take users who used the feature for the first time any day in December and divide by the total number of users you have on the last day of the month.
Time-to-Action:
This is the average time for a new customer to use an existing feature or the average time for an existing customer to use a new feature for the first time.
❗Things to remember:
The more users are engaged with a product, the faster they will discover new features, and the more often they will use them. This is a big caveat if you are running an A/B test for a not randomly distributed audience. The measurement of feature discovery and usage will be skewed and potentially overreported.
If your feature of the product is complex or requires customer learning, you might have to introduce the Time-to-Adopt metric in addition to the Time-to-Action metric to measure how much time it takes for users (either new or existing) to start using a new feature.
Another secondary metric to look at is Duration - how long users keep using the feature after discovering it.
To summarise, as you can see, product experimentation can get quite complex. Regardless of what type of testing the product team is launching, your responsibility as an analyst is to guide them and ensure that they use a proper foundation for the product experimentation.
📈 Related past articles:
5 Mistakes To Avoid When Running A/B Tests
Thanks for reading, everyone. Until next Wednesday!
Hi Nadya, thank you for reading my newsletter.
What teams usually do: they launch a feature to a small % traffic, let's say 25% total traffic, and compare metrics lift to other 75% users who don't have a new feature, and call it "A/B test". They think they compare Control (no feature) vs Variant (new feature). But in fact, this is not A/B test but simply Split test. There is a big difference.
You can't A/B test a new feature, because you obviously do not have any data yet to work with to prepare for the A/B test launch: you can't estimate MDE, test timeline, expected lift, or baselines. For A/B test as Bayesian, you need to set your statistics right. You do not always need this for split tests.
My recommendation is to launch a new feature to small % of traffic and monitor adoption metrics and sensitive user usage metrics (conversions, clicks, views) for a few days. Once you have confidence the new feature is not harmful (e.g., doesn't take traffic away from other features, positively affects activity), then you can expand the rollout to all users.
Once you have a month of data or more, run a Pre/Post analysis against your ecosystem metrics or top KPIs.
Hi Olga - Thank you for writing this. When you are launching a new feature (3rd category from above), you talked about monitoring KPIs (like LTV) and adoption metrics (like % of users using the new feature) and these adoption metrics will drive long term KPIs.
Do we still conduct an AB test with primary metric as one of the adoption metrics and compare test vs control results, let's say after 1 month of monitoring? Or are you saying it does not make sense to do AB test in this scenario at all. Instead we need to look at these KPIs and adoption metrics over a period of time with the new feature? In that case, what do we compare that against?