Data Analysis Journal: A/B testing

10 Experiments Every Data Team Should Run - Issue 288

Olga Berezovsky — Wed, 29 Oct 2025 12:03:12 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Today I want to share a recap and summary of a recent talk by Liz Obermaie, a former Meta data scientist now at StatSig, at the summit in San Francisco

I’ll admit, I was skeptical first. The title felt a bit over the top, but after watching the recap, I was impressed. The content is practical and well-structured. Nothing I haven’t covered in my newsletter before, but Liz explained it so clearly that I had to save it - and share it with you.

Below is my summary and commentary of her talk — the 10 experiments every data team should run. Some things I don’t agree with, but overall, it’s a very decent step into data science 2.0.

All Models Are Wrong (But Some Are Useful)

Quoting George Box, “All models are wrong, but some are useful”, is actually a great way to introduce experimentation. It’s controversial because people focus on the “wrong” part. But what it really means is that every model involves uncertainty and assumptions. There’s always a tradeoff between the assumptions you make and how confident you can be in the results.

That tradeoff sits at the core of experimentation. As data scientists, we tend to overcomplicate this. We want to explain every assumption in detail, while business stakeholders just want to know: What should we do?

Why randomized controlled trials matter

There are many types of experiments (behavioral studies, quasi-experiments, difference-in-differences, lots of user research qualitative techniques), but this presentation was mostly about randomized controlled trials (RCTs), where randomization does the heavy lifting - exactly what happens during an A/B test.

Randomization is very important - it balances test confounders, both known and unknown, so you don’t have to model every possible bias. That frees up your mental bandwidth to focus on the real work: edge cases, communication, and stakeholder alignment, rather than defending your assumptions. In other words, if you’re unable to randomize your traffic, you still can run experiments, but analyzing and interpreting them right will take 10x more effort and time - you would need to model the lift separately for every possible attribute.

Keep in mind that randomization can happen at different units - sometimes it’s not users, but sessions, queries, or even servers. The right choice depends on the assumptions you can safely make.

Part 1: Basic Experiments

These are simple but foundational. Most organizations run them early and often.

1. Standard A/B Growth Test

This is the classic marketing experiment. You test variations of a message, call-to-action, or landing page.

How Notion Scaled Experimentation - Issue 286

Olga Berezovsky — Wed, 15 Oct 2025 10:01:23 GMT

Today I want to share a recap of my favorite talk from the Notion data science team at the recent StatSig Summit in San Francisco.

It’s a great one for anyone leading data or analytics teams, or trying to scale experimentation from a few tests to dozens running at once. The Notion team shared what it really takes to build a strong experimentation setup: the right infrastructure, project process, analytics framework, and, most importantly, the right mindset.

I liked this talk the most because they didn’t chase flashy milestones like “50 active tests at a time”. Instead, they focused on quality over speed - how to build trust in results and make every test count. After all, if you can’t interpret one test correctly, running 50 more won’t help much.

Rethinking A/B Testing for B2B and SaaS - Issue 274

Olga Berezovsky — Wed, 13 Aug 2025 11:03:30 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

When we talk about experimentation, we usually mean it in the context of B2C products. It’s rare to find examples or case studies on A/B testing for B2B or SaaS, where onboarding is sales-driven, user samples are small, there’s little variance, and every customer has a dedicated account manager providing white-glove support. What’s there to test?

I used to think that if there was no self-service option for customers, there wasn’t much to experiment with.

I was so wrong! There’s a lot of experimentation happening in B2B, it’s just different.

I haven’t run A/B tests for B2B or SaaS yet, so when I saw Statsig’s talk at this year’s Data Council on experimentation in B2B, I was intrigued. It ended up being my second-favorite presentation (after Roblox’s session on causal inference).

Today, I want to share my takeaways and learnings from that talk. While the principles of A/B testing never change, running experiments in the B2B world comes with unique challenges and some surprising advantages. Let’s break them down.

Causal Inference Methods for Bridging Experiments and Strategic Impact - Issue 267

Olga Berezovsky — Wed, 16 Jul 2025 12:02:44 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Today I want to introduce you to someone great - a fellow data scientist, Wenjing Zheng, Senior Data Science Manager at Roblox. I met Wenjing in May at Data Council, and today I want to share her insights on connecting A/B test results to real-world business decisions at Roblox.

This is what data scientists spend most of their time doing - teasing apart the true impact of A/B tests from holidays and seasonality, from external factors driving change, and from other campaigns running in parallel. It’s hard, like explaining why a true +10% lift in transactions shows up as only a 0.01% ARR increase. I’ve shared some of my methods of estimating such impact before, and today, I want to show how Roblox is doing it.

Read below a recap of Wenjing’s talk on causal inference methods that help bridge the gap between clean experimental results and messy strategic decisions, how to attribute business growth to product launches, or generalize experiment outcomes to broader populations.

Roblox is an online platform where users can both play and create their own games, called "experiences". Wenjing leads a data science team responsible for the experimentation platform, which hosts hundreds of tests and manages time-series tooling for forecasting, business monitoring, anomaly detection, and root cause analysis. The team uses multiple causal inference methods to enable and support data science partners.

Below, I share slides and takeaways on how Wenjing’s team approaches estimations for A/B tests' impact.

How to segment out local vs. global impact?

Individual teams at Roblox run experiments on surfaces like the homepage, notifications, marketplace, etc. Each experiment has a local lift, like +1% increase in time spent from a new notification. But local lift ≠ global impact: some surfaces get lots of traffic (e.g., homepage), while others have niche audiences.

When leadership tries to prioritize based on impact, for example, “Team A improved time spent by 1%, Team B by 0.1%”, it ignores:

Reach: How many users are exposed.
Baseline: Was it already optimized or easy to move?

Typically, in this case, teams use qualitative intuition (“our surface has less reach,” “this was a hard problem,” etc.), leading to inconsistent prioritization. We need better ways to quantify it.

Significance Level vs. Statistical Power in A/B testing - Issue 266

Olga Berezovsky — Wed, 09 Jul 2025 12:01:14 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

A very nice reader pointed out a typo in my A/B test Checklist, and it reminded me of a moment during a meetup (many years ago) when I was arguing with a data scientist about whether we should adopt Type I and Type II errors or Alpha and Beta in our statistical documentation.

My point was that if you studied statistics in China, Europe, or the USSR, most of the literature refers to Alpha and Beta, and referring to Type I or Type II errors is less common (at least it used to be). But in the U.S., I’ve noticed that most documentation, especially around A/B testing (and especially newer sources), mostly uses Type I and Type II errors. That used to confuse me - I could never remember which one was the false positive and which was the false negative.

So instead, I trained myself to just stick with Statistical Power and Confidence Level.

But even then, I kept noticing that during interviews and at work, people often refer to Power as Sensitivity, to p-value as if it were Power (which it’s not), or simply to Significance, which, depending on context, can also mean Confidence Interval (which some sources call Significance Level 🫠😂).

The reality is: there are only 4 key statistical terms we deal with in A/B testing, and yet we constantly confuse them or flip their definitions depending on context.

So, I decided to publish a quick refresher focused only on the core concepts: Significance Level, Confidence Level, Statistical Power, and Confidence Interval, and how to tell them apart, understand what they actually mean, and share their expected values.

Spring Recap: 100 Winning Tests, No Growth

Olga Berezovsky — Wed, 18 Jun 2025 12:02:27 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

I waited to send out the May recap so I could include highlights from the Snowflake and Databricks summits, which just wrapped up last week. Today, I’m sharing the latest news from the data world - what’s changing, what it means for us, and what we can do to adapt and grow, even when things in the industry feel upside down.

Apologies for the long and somewhat heavy newsletter.

Below, I’ll share updates on the key tools we use every day, new benchmark reports we measure against, new good tutorials, interesting publications and research papers that caught my eye, and other important developments impacting our work in data analytics.

But first, 2 quick personal updates:

I am heading to Europe this summer - Barcelona in June, Paris in July, and London in early August. I’ll be meeting with a few founders, analysts, and friends along the way. If you’re nearby, I’d love to connect and chat all things analytics.
I am currently looking for a full-time, on-site Product Analyst in San Francisco (40 hours/week). If you are interested in working with me or know someone great, please get in touch!

🔊 Advocating for analytics

Last month, I attended MAU in Vegas, the largest mobile app AdTech and MarTech conference, with over 2,000 apps. It was awesome to reconnect with colleagues, friends, and discover so many new apps. But I walked away with some mixed feelings about the speakers, talks, and content.

At an AdTech or MarTech event, of course you'd expect a focus on marketing, advertising, and store optimization. But what’s becoming concerning is how deeply marketing is starting to shape product—specifically how activation, retention, and product growth are being treated as extensions of marketing, rather than the other way around. Here’s what stood out:

1. Marketers still miss the point on Activation.

Many see activation as just the final step in onboarding, or “a high-value action,” without digging deeper. The few sessions that touched on user activation missed the mark. Activation is not measured against Trials or TTPs. It’s measured against long-term retention. And most importantly, it has to be predictive of retention. That means modeling behavior across time windows and activity patterns. Learn more - Why Your Activation Analysis Is Wrong - And How to Fix It.

2. Measurement is often overcomplicated or misused.

Another thing that frustrated me was that in every other talk, I saw a slide where an A/B test was measured against ARPU. Why? Are there no other (proper) measurements available?

I understand when founders or engineers run quick tests and check ARPU in RevenueCat or A/B Tasty to make a fast decision. But when someone with a “Growth Consultant” title uses ARPU as the main metric, it’s worrying.

Here’s the thing: unless you’re changing the price, a higher ARPU might mean fewer customers. ARPU = Revenue / Customers, right? So, if your customer count goes down, ARPU can go up. So, if your A/B test shows Variant with +25% increase in ARPU, very likely, that’s not a win:

When Simple Becomes Tricky: Making Sense of Averages in Reporting

ARPU is a highly misleading measurement for paywall tests (again, unless you change subscription price, which is a different story).

If you are using LTV, it’s even more riskier. ARPU has 2 variables (revenue and customers), but LTV includes more (let’s break it down: (1) revenue, user lifetime, which is (2) recent renewal minus (3) signup, and minus (4) CAC in some cases), making it very hard to know what’s really changed. Without clear variable breakdowns, the results likely to lead you in the wrong direction. Please, don’t use LTV or ARPU to measure an A/B test:

How To Find Optimal Proxy Metrics

My MAU frustration didn’t stop there. Sessions on retention, pricing, and monetization had often relied on flawed math or shaky logic. And then we wonder - why after 100 “winning” tests, the MRR growth rate doesn’t improve M/M? Please, invest in analytics and due diligence. It really does matter.

🔥 Last month’s highlights

Snowflake Summit

Snowflake Summit was fun. Who isn’t excited about 200 vendors offering slightly different versions of the same ETL tool?

This year, Snowflake focused heavily on data ingestion, catalogs, and agents. They also announced support for Postgres. But the highlight for me was Snowflake Intelligence - a new platform that lets you query any of your data using natural language and receive human-like responses and charts without writing SQL. I don’t see how this is different from Databricks’ Genie, announced a year ago... In any case, it requires (a) all data to be available in the data warehouse, (b) a very solid metadata layer to be built, and (c ) the right questions to be asked, which brings us right back to where we started.

Databricks Summit

Databricks Summit felt pretty similar - same crowd, same vendors with more Spark references. They announced full support for Apache Iceberg and Delta Lake, MLflow 3.0 for end-to-end GenAI pipelines, direct integration with SAP systems, a partnership with Informatica, and more. For me, the most surprising update was that Sigma, a tool built to support BI for Snowflake, is now partnering with Databricks.

Personally, I find Sigma to be a questionable choice for BI.

I’m currently working on a deeper dive into Sigma - talking to its customers and reviewing use cases, and so far, I don’t understand why teams would use it. So, the Databricks + Sigma partnership was unexpeated and raises more questions.

Amplitude AI Agent Launch

Last week, I had the pleasure of joiningt the Amplitude AI Agent Launch:

“Instead of acting on a vague alert like “conversion is down,” you now have a system that thinks like a proactive teammate—one that not only spots issues but digs into the why, proposes next steps, and runs experiments. While you focus on strategic priorities, your Agents are working in the background, testing multiple hypotheses, analyzing impact, and surfacing the highest-leverage insights.”

I haven’t had a chance to test the agent myself yet, so I can’t comment on how much value it delivers to teams. I’d say, though, that you can’t fix what you can’t see, so the success of any AI agent will depend heavily on how well your tracking and analytics are set up.

Also, the demo focused on a fairly simple use case: identifying web dead clicks. Is this really a thing? Real-life use cases for hypothesis testing and analysis are way more complicated, but the overall promise and product direction looked impressive.

If you’re using Amplitude, you can sign up for the beta and try it out (and I’d love to hear your thoughts if you do!)

dbt is upgrading. Meet the dbt Fusion engine.

Meet the dbt Fusion Engine

“It’s the same dbt you know and love, but better and faster”.

dbt recently announced the Fusion Engine, which is currently in beta. It promises a lot of new features for teams, including a VS Code extension, real-time error detection, faster parsing, and improved project execution.

I haven’t figured out yet whether Fusion needs to be installed alongside Core or if it fully replaces it. From what I can tell, you’ll likely have to make a choice: stick with dbt Core, or get your team ready to “invest” and upgrade to Fusion.

📈 New industry reports and benchmarks

The SaaS Go-To-Market Report from ChartMogul
Mary Meeker's 2025 AI Trends Report: An In-Depth Product Analysis
Experimentation Maturity Program Reports 2025 from CXL

⚙️ Know your craft

Not sure which chart to use? This cheat sheet from Metabase will help you choose the right chart for your data.
Five simple things that will immediately improve your diagrams from Vexlio
What is embedded analytics? from Metabase.
The essential guide to user journeys for product teams - a must-read guide for product analysts from Mixpanel.
How Public Companies Define ARR from .
Visualizing the World of Causal Inference from Vasco Yasenov, a Staff Data Scientist at Adobe.
99-ML-Learning-Projects - A list of 99 ML projects for anyone interested to learn ML from coding and building projects.
The 80/20 Guide to R You Wish You Read Years Ago from

🤓 Analysis and case studies

Last month, I published Introduction to Forecasting, followed by Part 2 - Forecasting in Analytics: Choosing the Right Approach, followed by an interesting discussion on LinkedIn where people asked me to break down my manual projection approach. So, now I’m working on Forecasting Part 3, coming (hopefully) soon. It’s not easy to consolidate, because forecasting revenue is very different from forecasting subscriptions or MAU. And completely different from projecting churn or growth rates. I’ll likely need to break it down by use case and product type to make it usable.
Faster, Smarter, Cheaper: AI Is Reinventing Market Research - from a16z
No, really, everything becomes BI from
The Illusion of Causality in Charts from

🎓 Tutorials

Not a tutorial, but I wanted to share that RowZero, a fast-growing spreadsheet for big data, has launched a free plan for students and recent graduates who want to learn analytics and get comfortable with big spreadsheets, functions, pivot tables, charts, etc. The platform lets you explore large public datasets, collaborate with others, and publish your research.

Thanks for reading, everyone!

Previous Recaps:

Refresher on A/B Testing - Issue 258

Olga Berezovsky — Wed, 14 May 2025 12:02:31 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Yesterday, I ran a workshop on experimentation for a wonderful team of analysts who are transforming the creator economy. It hit me - I’ve never actually shared a proper introduction to A/B testing. Apparently, I just assume people are born knowing what statistical power or a p-value is.

So, today I decided to consolidate free tutorials and my bookmarked materials to help you get started with A/B testing and experimentation, including tools, key concepts to understand, free classes, and more.

Before we jump in, spreading the word about a couple of upcoming events next week:

🗓️ Data Science Salon returns to San Francisco:

Join DSS at the Amazon office next Wednesday for a meetup featuring speakers from Adobe and NextData. Teams will be talking all things data prep and management for AI and automation. Come hang out!

🗓️ I’ll be in Vegas next week, joining 2,000 marketers at MAU Vegas 2025 - mobile app AdTech and MarTech conference. If you work in mobile app analytics, come say hi! Use this link to get a discount: MAU Vegas 2025.

I have a dedicated experimentation section in my newsletter, but I realized most of my publications are geared toward experienced analysts, diving into tricky cases like accelerating experimentation, testing on low traffic, or running tests in marketplaces. What I don’t have is a simple introduction to A/B testing.

That’s partly because you can’t really start with experimentation without first understanding statistics. And I don’t think it’s helpful to talk about types of A/B tests with someone who doesn’t yet understand p-values or probability. (Explaining p-values isn’t easy either.)

So, I don’t believe there’s such a thing as a gentle introduction to A/B testing. And it’s definitely not something you just "pick up" on the job. Even basic familiarity with testing requires a solid foundation in probabilities. (Maybe that’s why it’s so hard to find analysts experienced with experimentation?).

There are lots of paid courses, online academies, and YouTube videos out there on experimentation. And Maven products. Below, I’m sharing my vetted, trusted, and personally bookmarked free sources on A/B testing. If you go through them step by step, in a phased approach, I guarantee you’ll be well-equipped to work with A/B tests.

A/B Test Checklist - Issue 233

Olga Berezovsky — Wed, 20 Nov 2024 13:02:54 GMT

Hi everyone, I just realized it’s been 2 years since I shared my A/B Test Checklist. It’s one of the shortest articles in my newsletter but also one of the most popular. I designed it as a quick, handy guide and a refresher on terminology, steps, and key calculations.

This checklist is meant for product and marketing leaders who may not necessarily have a background in statistics but have to work hands-on on experimentation.

I’ve updated the guide to make it even clearer and more actionable. I hope you find it helpful!

✍️ Steps for conducting a product experiment:

1. Can you test it?

Not everything can be A/B tested. New experiences or major product releases may not fit the typical A/B testing framework (read How To Measure New Feature Adoption). Consider potential bias, such as novelty effect or change aversion.

2. Should you test it?

Do you have enough users or events? Without a sample size of at least 50,000 users, A/B test results may not be statistically reliable.

Set your expected rate – this is your Minimum Detectable Effect (MDE). The MDE represents the smallest acceptable difference between the Control and Variant. If the Variant is 0.02% better than the Control, would you still want to run the test? Is it worth the cost and time?

Is it a good time to run the test? If the tested dashboard will eventually be sunsetted, the user flow replaced, or removed, there is no point in testing features that will be removed or deprecated.

Also, consider factors such as seasonality, upcoming version releases, open bugs, etc.

3. Formulate A Hypothesis

The “Newly tested user navigation flow will improve user retention” is too vague to serve as a hypothesis. Instead, consider more measurable:

New navigation flow will increase the frequency of user app opens per day by at least 10%.
New navigation flow will improve View-to-Paid CVR by at least 5%.
New navigation flow will drive at least 10% more DAU to the dashboard.

A well-defined hypothesis should be focused on one or two success metrics.

4. Finalize your set of metrics.

For A/B analysis, I use a set of 3 metrics:

How Long Until We Reach Significance? - Issue 231

Olga Berezovsky — Wed, 06 Nov 2024 13:00:58 GMT

With all eyes on today’s election results (thankfully, I don’t know them yet since this post was scheduled a few days ago), I know many of you might feel tired and drained. But here we are on Wednesday, and I’m using any opportunity to advocate for analytics done right. So, welcome to my Data Analysis Journal - a weekly newsletter about data science and analytics.

Subscribe now

One of the most popular interview questions, and the first step when preparing for a test launch, is estimating how long the A/B test will take to run.

Figuring out the duration of a test really comes down to estimating how long it’ll take to reach significance.

And to estimate that timeline, you need to calculate the sample size needed for the A/B test.

Today, let’s dive into some easy and challenging ways, along with calculators and methods, to help you predict how many days, weeks, or even months it might take for a test to reach significance.

Frequentist vs. Bayesian: Which Method Should You Choose for Your A/B Testing? - Issue 224

Olga Berezovsky — Sun, 29 Sep 2024 12:03:11 GMT

I don’t usually clutter your inboxes on a Sunday, but I have one more quick topic to share with you this month.

Last week, a colleague asked me which type of A/B test he should run - Bayesian or Frequentist. I searched for an article that explains the use cases but couldn’t find anything that doesn't dive into Hypothesis Testing Theory over 10 pages while still being helpful. So, I decided to write my own today.

Let’s talk about the types of A/B testing and which method to choose for different use cases. I know it’s Sunday, so I’ll keep it short and sweet.

A quick recap on hypothesis test statistics:

An A/B test is a form of hypothesis testing. There are many types of A/B tests, and at work, we typically apply A/A, A/B, Split, or Multivariate testing.

Frequentist and Bayesian approaches are different methods of interpreting A/B test results. These are NOT types of A/B tests you can choose but rather approaches to interpreting the test results.

While most articles refer to Frequentist and Bayesian methods as “ways to read test data,” I prefer to think of them as different schools or, better yet, movements in statistics to handle experimentation.

Frequentist statistics

If you graduated from any statistics class, you were likely taught the Frequentist method for interpreting A/B test data. This is primarily what academia has taught us. The Frequentist approach makes inferences about the test results using only data from the current experiment, without any prior context about user behavior. All parameters (mean, median, variance, etc.) and distributions used for A/B test analysis are considered fixed.

The Database of Winning A/B Tests - Issue 218

Olga Berezovsky — Wed, 21 Aug 2024 12:01:24 GMT

Before we start, I want to share an exciting opportunity in New York: join Peloton's Product Analytics team as a Data Scientist! 📊📈

Have you ever wondered why companies spend time on A/B testing when thousands of other companies have already tested the same and know which shape, color, and type of CTA performs better?

Wouldn’t it be amazing to have a universal database or repository of all completed A/B tests, with documented learnings on what color, banner size, and paywall format performed best?

Well, there is one! Not one, many. And they are growing.

Today, let’s talk about:

Where you can access these learnings from other companies and where you can share your own (please do).
Why we keep testing and re-testing exactly the same things. Over and over again.
What the test success rate is at other companies.
What the common winning stereotypes are (dark-screen mode, rounded buttons, sticky banners, etc.), and how accurate and trusted they are.

A/B tests repositories

Why it’s wrong to run A/B test for too long - Issue 213

Olga Berezovsky — Wed, 17 Jul 2024 12:02:10 GMT

I’ve talked a lot about why you shouldn’t stop A/B tests early and what the caveats are to disregarding the test significance.

However, one topic I haven't yet addressed in my newsletter is why it's also wrong to run an A/B test for too long.

In other words:

if the test proved the hypothesis and
the winner is truly making a difference and
the observed lift didn’t occur by chance,

Why does it matter if the test runs for 2 weeks or 6 months? If the winner is indeed true, shouldn't the longer timeline confirm it?

Not quite. Let's dive into this topic today.

Methods To Accelerate A/B Testing - Issue 202

Olga Berezovsky — Wed, 15 May 2024 12:01:41 GMT

Last year, I published Embracing the New Era of Accelerated Testing, which was both somewhat controversial and emotional to write.

As a statistician trained to adapt academic principles to the fast-paced tech environment where nothing is trusted, I had to acknowledge that the concepts we were taught at school have become outdated and no longer serve us well.

The new generation of tools has accelerated the speed of product delivery. Every aspect of mobile/web development, including design, QA, and research, now runs twice as fast as it did a few years ago. Tools like Split, Superwall, Adapty, and even native Apple solutions offer incredible capabilities. Today, we have the technology to iterate "on the fly" by continuously shipping and optimizing features.

However, A/B testing practices and frameworks have remained the same, creating a gap between how fast the team is ready to move, how much trust we put in the data we receive, and how quickly we decipher its signals.

Teams want to run more tests - faster and more efficiently.

Let’s discuss today what you can do to increase test velocity. What statistical methods and solutions are available for you to leverage to speed up testing, and how can you strike a balance between trust and speed?

Analytics is accelerating. Gear up.

Duolingo is running "a few hundred experiments simultaneously.” At Pinterest and Uber, over 1,000 experiments are active at any given time.

How To Find Optimal Proxy Metrics - Issue 197

Olga Berezovsky — Wed, 17 Apr 2024 12:01:59 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Teams often mistakenly evaluate A/B tests against North Star metrics or business KPIs, such as user retention, customer churn, revenue, or LTV. This approach is flawed for 2 reasons:

First, business metrics are not sensitive, meaning any lift created by an A/B test should not be reflected in KPIs (unless your MDE is over 50%).
Second, business metrics are designed to resist short-term changes from A/B tests and to reflect only long-term impacts.

Despite this, using business KPIs to measure the effectiveness of A/B tests remains a common practice. To address this issue, researchers from Google, Stanford, and the Department of Statistical Science at Duke University collaborated on a study to explain why you should use sensitive proxy metrics for A/B tests instead of the North Star or business KPIs.

They introduced the concept of Pareto Optimal Proxy Metrics, which significantly optimize the accuracy and sensitivity of lift predictions. Let’s dive into why business core KPIs aren’t suitable for A/B tests and how to select the appropriate proxy metric for measuring product rollouts or feature optimization.

Product is a subset of the business.

If you have been reading my newsletter, you're already familiar with how I classify all metrics into three layers in product analytics:

1. Success metrics

Success metrics are 3-4 data points that directly illustrate the change in a user behavior affected by a product launch or a test. It can be the total or the average number of user actions, number of sales or transactions, click-through rate, monthly/annual subscription ratio, or else.

2. Ecosystem metrics

Ecosystem metrics are typically top or core KPIs. These are high-level business metrics that often involve calculations, specific date window logic, and filters. This can be MAU, MRR, Churn rate, or the North Star.

3. Dark side / Tradeoff metrics

Tradeoff, tension, or counter metrics are meant to show a negative effect of a product change that you want to keep your eyes on. For example, it can be % unsubscribes, refunds, reports of spam, or fraud. As I said earlier, if your product change increases usage in one feature, you are likely to see a decline in another. Users shift where they spend time. This metric helps you to measure the balance and fully understand the impact.

Success metrics are not Ecosystem metrics.

Bellow, I will reference the Pareto optimal proxy metrics study conducted by data scientists Lee Richardson, Alessandro Zito, Jacopo Soriano, and Dylan Greaves.

“North star metrics are central to the operations of technology companies like Airbnb, Uber, and Google, amongst many others. Functionally, teams use north star metrics to align priorities, evaluate progress, and determine if features should be launched. Although north star metrics are valuable, there are issues using north star metrics in experimentation. To understand the issues better, it is important to know how experimentation works at large tech companies.”

Traditional business KPIs are not suitable metrics for assessing the impact of product initiatives. These KPIs primarily serve as ecosystem health indicators meant to describe the state of business and safeguard it against various seasonal, external, and micro effects.

However, when conducting an A/B test, relying on metrics like retention may not accurately reflect the true impact of an initiative. For example, users in the Variant group might return to your app 1% more than those in the Control group, but 62% of them could also be influenced by other hidden effects you can’t capture or measure. These could include phone settings, a new viral video on TikTok, or unrelated factors like the plum season or a time change.

Remember the case study from Strava on how they boosted user engagement through a redesign of the Route Detail page? For the A/B test, while they aimed to improve user retention in the app, they chose to use very sensitive, granular metrics such as average page views per user, saving routes, downloading routes, recording routes, and others, rather than relying on broader measures like week-over-week product usage or retention.

Retention is an ecosystem KPI. It is not sensitive to isolated fluctuation or noise. It isn’t designed to be sensitive because it’s an output metric (read Common Mistakes In Defining Metrics by Brian Balfour). However, clicks, page opens, and transactions are good for capturing immediate responses and variations, thus offering a more nuanced view of specific actions or behaviors.

Finding optimal proxy metrics

What is a proxy metric?

“The ideal proxy metric is short-term sensitive, and an accurate predictor of the long-term impact of the north star metric.”

Here are 2 scenarios where the proxy metric helps teams overcome the limitations of the North Star metric:

Using business KPIs or North Star metrics for A/B testing has 2 main concerns:

1. Ecosystem core metrics are not sensitive.

Sensitivity refers to the metric’s ability to detect a statistically significant effect, often associated with statistical power. If the metric is not sensitive, it means that the test results do not clearly indicate whether the hypothesis is improving or degrading the KPI.

“For example, metrics related to Search quality will be more sensitive in Search experiments, and less sensitive in experiments from other product areas (notifications, home feed recommendations, etc.).”

2. Business metrics are not directional to user behavior or experience

“Through directionality, we want to capture the alignment between the increase (decrease) in the metric and long-term improvement (deterioration) of the user 4 experience. While this is ideal, getting ground truth data for directionality can be complex.”

Researchers suggest several methods to measure directionality by comparing the short-term value of a metric against the long-term value of a KPI.

“The advantage of this approach is that we can compute the measure in every experiment. The disadvantage is that the estimate of the treatment effect of the north star metric is noisy, which makes it harder to separate the correlation in noise from the correlation in the treatment effects. This can be handled, however, by measuring correlation across repeated experiments.”

Essentially, you need to run correlations between a proxy metric and a KPI, measuring their linear relationship (or it can be a Spearman correlation).

However, there’s a catch: there is an inverse relationship between sensitivity and directionality:

Finding the right proxy metric involves balancing sensitivity and directionality: the more you increase sensitivity, the less likely the proxy metric will relate to business KPIs. This is a very frustrating aspect of product experimentation. Basically, the faster you can detect a significant lift in a success metric, the further it may be from reflecting true ecosystem KPI improvements, and vice versa.

Researchers have proposed a new method for identifying proxy metrics that optimizes the trade-off between sensitivity and directionality using a Pareto optimal framework (or Pareto efficiency), a concept from economics and statistics used for problem optimization when dealing with multiple functions or factors. They tested 3 algorithms on over 500 experiments conducted over a period of 6 months. This testing aimed to compare the proxy metric versus the North Star metric, using binary sensitivity and the proxy score to assess the quality of the proxy metric.

“The key idea is that the proxy score rewards both sensitivity, and accurate directionality. More sensitive metrics are more likely to be in the first and third rows, where they can accumulate reward. But metrics in the first and third rows can only accumulate reward if they are in the correct direction. Thus, the proxy score rewards both sensitivity and directionality. Microsoft independently developed a similar score, called Label Agreement”

Read more about their study - Pareto Optimal Proxy Metrics.

I'm not aware of any tool allowing product owners to quickly evaluate the relationship between proxy metrics and business KPIs, guiding them toward the appropriate success metric. Ideally, you would have a fellow analyst on your side who can help translate each business KPI into suitable proxy metrics. These metrics can then be used to measure the effectiveness of product initiatives. For example:

Monthly retention → screen views or particular clicks.
MRR/ARR → successful transactions or completed payments.
Churn → cancellations, requests to cancel, or unsubscribes.
Subscription renewals → successful payments or transactions.

I am fascinated by this study. Not only is it timely, but it's also easy to understand and interpret. It addressed long-standing pain points we deal with in experimentation - why are improvements in the Variant not reflected in ecosystem core metrics? Why do these improvements only become visible in KPIs 4 - 5 months after the test is completed? And why every method attempting to quantify the relationship between the success proxy metric and the business KPI seems destined to fail.

Thanks for reading, everyone. Until next Wednesday!

Related publications:

Why You Shouldn’t Stop A/B Tests Early - Issue 193

Olga Berezovsky — Wed, 20 Mar 2024 12:01:50 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Today, I wanted to cover the most commonly asked questions on experimentation in analytics:

How long should you run an A/B test? It’s recommended for 2 weeks, but why?
Can you (or should you) stop an A/B test early?
If you have to, what is the safest approach to handling fast A/B tests?
Why are slow rollouts dangerous?
What is the recommended procedure for gradually launching A/B tests over time?

There's a wealth of information on A/B testing available, ranging from academic papers on the frequentist approach (which is often less relevant for marketing or product analytics) to complex probability theories. Today, many product and analytics leaders may not have a background in statistics. Their knowledge of A/B testing often comes from online courses and self-study. As a result, they might not grasp the differences between A/B testing and Hypothesis testing or A/B testing and Split testing. This can lead to applying the same strategies to each or, even worse, having the same expectations for data trust in their results.

“A Bayesian is one who, vaguely expecting a horse, and catching a glimpse of a donkey, strongly believes he has seen a mule.” Dr. Rob Balon.

While I can't reinvent statistical methods to make Bayesian analysis fit the Frequentist framework, preventing stakeholders from being offended when analysts admit to having low trust in data (breaking news: most winning A/B Test results are illusory), I can at least re-iterate and clarify the basics here.

Why significance matters

Because early experimental data is more likely to be wrong.

You shouldn’t stop the test early - even if you think you see a clear winner - because of regression to the mean.

Regression to the mean (RTM) theory describes the false-positive result. It’s an effect when a variable is extreme at first but then moves closer to the average. In real life, the RTM conversion looks approximately like this:

As you can see, the Variant conversion fluctuates significantly at first but then normalizes and begins getting closer to the mean.

In the examples above, Variant conversion changes a lot during the first 10-12 days and then stabilizes. The chart on the right is a more extreme example to illustrate how impactful and prolonged RTM can be. In my experience, I often see it stabilizing after around a week or so, eventually looking more like this:

Depending on the daily sample volume, your test will show something else. Once you have completed a few tests, it becomes easier to identify the test stages relevant to only your product and tested traffic volume. These stages will guide you in estimating the test timeline to reach significance and decision-making, especially for the “fast” tests.

For the example above, for Bayesian tests, I segment 3 lifecycle stages:

The first stage can be from a few hours to a few days and serves as a validation step. This is the time when you pick random users and check their attributes, distribution, tenure, etc.
The second stage is vital and is usually short (often just a day or two). It is a turning point for estimating significance. (It doesn’t mean the data is significant. But at this stage, for analysts, it’s usually easy to confirm how long the test will be running).
The third stage is the longest and the most stable. The test is getting confidence during these days. Your conversions are not expected to move significantly afterward.

In your analysis, you treat these stages differently. Changing test traffic volume or setup during the first or the second stage is dangerous because you are likely to run into strong RTM. That being said, making test alterations during the third stage can be a compromise, which would be to move faster but at the cost of a lower degree of confidence.

Other definitions of RTM I run into are “random measurement error” or “non-systemic fluctuations around the true mean”. RTM is quite tricky and makes the test data you receive look like a meaningful result (when it isn’t!). It becomes even more concerning when you accept it as a true test outcome and then run other tests against this “winning” group. And now you’ve ended up with multiple flawed tests (which is exactly what split tests do).

Why does RTM occur?

To simplify, you are likely to run into this effect when:

Your samples are not randomly distributed.

You are at the mercy of your experimentation instrumentation tooling. If it doesn’t do a great job of randomly disseminating users (which is often tricky to validate), there isn’t much you can realistically do except lessen your trust in your A/B tests.

Multiple rollout stages, sample size, or variance change.

These are the slow rollouts. Let’s say you launch Control and 2 Variants to 25% traffic with a new 34/33/33 split. After a few days, you remove Variant 2 and launch Control and 1 Variant with a 50/50 split to 50% traffic. You are way more likely to end up with a high probability of RTM.

Your target audience has too many properties.

The more attributes your target audience has, the higher the opportunity is for RTM. For example, to be qualified for the test, your users have to be in a particular country, use a specific language, be paid subscribers or complete at least 4 transactions, are eligible for a trial or a promo, and have particular settings enabled on their profile, or else. All of these are user attributes that qualify them to receive a test experience. The fewer attributes you apply, the cleaner (and faster) your test will ultimately go.

How to prevent RTM?

In truth, there is nothing you can do to eliminate the effect, but you can reduce the probability that it will affect your test. Here are some basic common protection practices:

Before any experimentation, know your expected conversion (baseline). Ideally, measure it multiple times, know its range, and compare averages. This will help you identify if your Variant conversion is way off or suspiciously high or low. Don’t trust Control as your true Baseline.
Avoid running experiments on complex user groups that have many properties. The more attributes you introduce, the more complex the experiment becomes.
The test groups should be randomized and normally distributed (avoid mixing new and existing users).
Avoid slow and disproportional rollouts. For example, once the test has been released to 10% traffic, do not reduce its size.
Don’t stop the test earlier than planned, even if it appears you reached significance or you have a clear winner or loser.

If you have to finish the test ASAP and don’t trust the results, you can run the analysis of co-variance ANCOVA (ANOVA+regression). It used to be easy to run in SPSS, SAS, Stata, or other old-world statistical packages, but now you can also run in Python Pandas.

If you don’t have the resources or time to run ANOVA, here is my “rule of thumb,” which you’re welcome to borrow (with a degree of caution):

🟢 If variant conversion doesn’t fluctuate and stays the same throughout vital stages of the test, it’s likely trusted. You can end the test early, expand more traffic to it, and move on.

🟡 If variant conversion fluctuates during the early stage but stabilizes, you will likely be dealing with not normalized or randomly distributed samples. If this is a common effect, it means that your experimental instrumentation is not good. You can’t go fast with these tests. You have to run it long enough until you get significant data for a prolonged period. Lots of validations and checks are needed for such tests.

🔴 If variant conversion fluctuates during all stages of the test, and the significance is reached with a small difference from Control, you are likely dealing with an inconclusive test. Disregard its result and relaunch with a new audience or approach.

Hopefully, the early pattern you see stays the same or better and becomes more emerging throughout the test. Just don’t be careless with statistics and follow all the necessary steps and practices to ensure the testing effect you see doesn’t happen by chance (and remember, there is always a probability it does).

Learn more on RTM and test setup:

Thanks for reading, everyone. Until next Wednesday!

How To Run An A/B Testing On Low Traffic - Issue 181

Olga Berezovsky — Wed, 10 Jan 2024 13:02:07 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Before we dive in, I wanted to ask a favor to fill out this brief survey (And thank you so much to those who have already done it! You are the best! ✨). As my audience grows, I’d like to better understand how my journal can stay relevant and bring more value.

The most common question I have received from my readers has been how to do A/B testing with very low traffic.

There are no specific rules or requirements dictating when to start A/B testing or the minimum number of users necessary to launch a test. There are various opinions on how to approach experimentation with small sample sizes, ranging from more conservative “You should not run an A/B test until you reach 30,000 users per variant period” to the more progressive “It’s fine, just adjust your thresholds.”

Today, I will share my guide on the procedure, specifics, and caveats of experimenting with low traffic. I will cover topics such as the minimum number of users needed to launch an A/B test and offer suggestions on increasing confidence and trust in small sample tests.

I’ll present it in a Q&A format to ensure I address all the questions I received from readers regarding low-traffic tests (I apologize for the delay in addressing this):

What is the minimum number of users needed in a sample size for an A/B test?
Can we run a test on a very small sample (~50 users in Variant per day) and wait longer to reach significance?
What other factors should we consider to ensure we make the right decision?
How do you figure out the trade-off between confidence and test timeline? For example, can we test small samples with an 85% confidence instead of 95%? Will the test run faster?
What are some rules or testing limitations in low traffic to address to increase statistical rigor?

If you need a refresher on the A/B testing concept or procedure, read the A/B Test Checklist.

“When running online experiments, getting numbers is easy; getting numbers you can trust is hard.” - Online Experiments: Practical Lessons, Microsoft.

Q1: Is there a set rule for the minimum number of users required to launch an A/B test?

An Analysis Of Bias Or Why A/B Testing Fails - Issue 172

Olga Berezovsky — Wed, 15 Nov 2023 13:00:57 GMT

A few years ago, Stanford collaborated with Airbnb on a fascinating research study - Experimental Design in Two-Sided Platforms: An Analysis of Bias. Last month, this paper was recognized with the Dantzig Dissertation Award and the MSOM Service Management SIG Best Paper Award.

It’s a fascinating study that should be close to any data scientist’s heart who works with A/B testing and who has, at some point, given up explaining why rapid iteration may not work and why A/B testing often fails.

The study offers a new way of setting up A/B tests and a new perspective on experimentation. Today, I am sharing a high-level recap of this research and integrating its theory and findings into our daily work. I’ll review the typical A/B test setups in marketplaces, how they fall short, and what methods the new research offers to address the bias.

I also want to emphasize how crucial it is for analysts to understand the complexity of interference, how dangerous bias is for the test analysis, and how we can continue to develop ways to navigate communication and uncertainty.

A/B testing is one of the methods of proving causation. If you want to prove A is causing B, you need to do two things: A/B test it, and then A/B test it right.

A recap from How To Prove Causation:

Causal inference methods include hypothesis testing (experimentation) and observations (user research). Keep in mind that experimentation (A/B tests, multivariate tests, split tests, etc.) and user research (surveys, field studies, interviews, etc.) don’t guarantee you clean causality; it’s subject to randomization, significance, confidence levels, treatment group size, setup, and more.

If you have been reading my newsletter, you should know that the success of your test depends on:

Instrumentation
Data maturity
The process should be tailored to reflect the nature of your product and business.

This academic study reiterates my points about the importance of test instrumentation. The test design should appropriately address your unique product offering, customer, and market. If you can’t validate or prove the test setup, or if you question the ability of your instrumentation to allocate users accurately, your test analysis might be wrong and lead to incorrect assessments.

Another reason why this particular research stands out is that Ramesh Johari himself led it. Ramesh Johari is a famous scientist, researcher, and professor who spent almost 20 years learning experimentation and has been advising Optimizely, Stitch Fix, Upwork, Airbnb, Uber, Bumble, Stripe, and more.

💡 Make sure to listen to his recent interview in - Marketplace lessons from Uber, Airbnb, Bumble, and more:

“Many of the changes that are most consequential create winners and losers. And rolling with those changes is about recognizing whether the winners you've created are more important to your business than the losers you've created in the process.“

What this research is about:

This 56-page academic study offers a framework to better understand how to set up and analyze experiments in two-sided marketplaces - products that offer interactions between different user groups (like buyers and sellers, readers and writers, listings and customers). The study is relevant to any platform with a “dynamic inventory model” - bookings, services, content, or tutorials, such as Upwork, Udemy, Airbnb, Substack, Uber, eBay, Etsy, Medium, Twitter, Instagram, Meta, and more.

The challenge: there is no easy way to eliminate interference.

This study focuses on the interference effect, where a treatment applied to one group of users unintentionally impacts another group. This phenomenon leads to bias and incorrect estimations, preventing an accurate test read.

How tests are run today in marketplaces.

Embracing the New Era of Accelerated Testing - Issue 150

Olga Berezovsky — Wed, 28 Jun 2023 12:01:01 GMT

Hello, and welcome to my Data Analytics Journal, where I write about data science and product analytics.

This month paid subscribers learned about:

Decoding Regression Scores - Linear Regression Part 2: How to read regression plots and interpret the equation that drives analysis and predictions.
How To Prove Causation - If correlation doesn’t imply causation, then what does? Examples of causation analysis, how to use regression for decision-making, and how to recognize when the regression pattern you see is correct and trusted.
A Deep Dive Into User Engagement Through Tricky Averages - A reminder of how averages can be misleading. Math and steps to report “average per user per day” KPI to measure the depth of user engagement.
Subscribe now

A few weeks ago, I went to the first reopened DS&ML meetup series organized by WiMLDS and Lyft. One of the delivered talks there about Demonstrating Leadership through AB Testing stuck with me and got me thinking. And thinking.

The talk was great. It was essentially a textbook-like A/B test methodology nicely laid out in a concise and structured way, illustrating foundational aspects with a solid case study shared by the Pandora product analytics team. And yet, when I tried to map the described way of the A/B tests framework into the current development workflow, it was clear it won’t set my team up for success.

The modern boosted “optimization” trend makes me question its methods and wonder if the current school of experimentation is actually set up for it. It seems to me that the way of doing hypothesis testing, which we learned back in statistical class, doesn’t serve us well anymore. In order to succeed, we have to make questionable compromises, break some boundaries, and bend backward to make it all work.

So today I’ll talk about the dissonance between tempting “optimize your app in no time and effort using our tool to grow 10x faster” and being data-driven in its true meaning - making sound decisions based on facts and proven insights that you can trust and validate.

And the biggest question I have on my plate is how do you set the data science and analytics team up for success to make sure you empower this new accelerated “optimization” culture while keeping the confidence and accuracy of the insights you share?

As I mentioned before, A/B testing is often the culprit of tension between analysts and product owners. After reading success stories from Meta (How We Shipped Reactions), LinkedIn (Detecting interference: An A/B test of A/B tests), or Airbnb (Experiments at Airbnb), many product leaders are inspired to follow the trend of constantly iterating with user flows, copies, CTAs positioning, layouts, etc. Then analysts come in and question the test objective, setup, baseline metrics, and experimentation toolkit. Quite often they have to push back on the test launch, delay the test results readout, or even disregard them. They might appear to hinder progress when they’re really just trying to do their job.

Your ability to test is linearly correlated with how well your analytics is set up. This comes down to (A) the testing instrumentation, (B) analytics maturity, and (C) test procedure and protocols.

Now, let’s imagine you have:

The instrumentation of tomorrow that “streamlines your feature flagging” and allows you to “build and deploy in 10 minutes flat” and “easily control multiple flags at once” (Kameleoon, Superwall, LaunchDarkly, DevCycle to name a few).
On top of it, let’s pretend you work at some mystical and magical workplace with a solid analytical foundation set, maintained by dedicated full-time data governance champions. You have your events, definitions, data attributes, and schemas in order (don’t believe in tooling for this).

Then what? The success of experimentation will come down to the same old standard:

Test procedure and protocols.

But here’s the thing. The current “academic” workflow we have been taught to follow (and by which I stand by) ironically will fail you once you have a solid instrumentation and dream data&analytics governance. And here’s why.

A/B tests become a never-ending lifecycle of quick optimizations

Testing today, especially on mobile, evolves into a never-ending series of short A/B tests. There is no beginning, no end of a product feature iteration. It's an eternal evolving spiraling cycle that we may never escape from. And this is a new thing aggressively boosted with modern tooling for app development.

From Superwall:

“We actually allow you to iterate an experiment in flight. Let's say you have a (A: 50) (B: 50) setup and B starts winning, you can maintain the existing users in their groups and either (1) Start assigning more users into B to minimize the opportunity cost of the test, while slowly marching to stat sig or (2) Introduce C, while still assigning users to (A:25) (B:25) (C:50) to quickly iterate. It's not a statistically perfect method but allows for directionally correct iteration w/o having to wait for the experiment to complete. It's something some of our most successful teams do.”

Our previous academic approach to data handling and evaluation for a test simply doesn't work anymore. Most analytics teams are not equipped or skilled for this. We were taught to treat every A/B test as a project - create a ticket for it, assign an id for the experiment, estimate the timeline to reach significance, document a case study for each test, have analysis ready, checks, 3-month follow-ups, you know the story.

Now this "old-school" approach doesn't work anymore. It holds you back, it blocks teams, and it's not scalable or efficient.

I am astonished about how today's modern tooling accelerates optimizations. Today’s experimentation enabled by Superwall, Qonversion, Split, and others set the bar very high. Just think about it: a given user is put into 5-10 different tests or "experiences” simultaneously, which are connected, overlap, or worse of all: dependent on each other. For example, you can test the onboarding flow, paywall conversions (which are connected to the onboarding flow), and accessibility of premium features (which are dependent on paywall conversion) for one user concurrently. It's growth-fascinating and statistically-terrifying at the same time.

Before, it was "an error" not to exclude users in your test analysis who got in various tests at the same time. Now, quite often you are forced to keep them to make it significant (mind you, most of the tests have slow rollouts and might be running at small % traffic on one mobile platform). You have to have some type of bias measure - how harmful, how polluted users are. It's not Control vs Variant anymore, it's Test_1 Control vs Test_2 Variant_A vs Test_3 Variant_D and possibly a branchy tree downstream of small samples. Teams don't realize they work with complex multi-variable tests, and winner Variant_B from Test_1 might not be "compatible" with a winner Variant_A from Test_2:

“Simpson’s Paradox:

A trend or result that is present when data is put into groups that reverses or disappears when the data is combined.” read more about Simpson’s Paradox and Interpreting Data.

🚩To reiterate, within a one-time frame, you won't be able to confirm for multiple tests if the lift they demonstrate (a) didn’t happen by chance and (b) is best suitable with all the iterations actively tested.

I don’t intend to block or hinder experimentation. It’s exciting what the latest feature flagging and experimentation tools can do. But by the end of the day, it’s my team's responsibility to represent trust, confidence, and correctness of the test inference. Here are some of the things I am trying out or that proved to back me up.

✅ How to adapt your team to the thriving era of experimentation

Pre/Post analysis: always keep a check on high-level KPI. Regardless of what test impact you see, the ecosystem metrics (usually reported monthly) don’t lie. Set up reminders and follow up in 30 days, or 3 months, and keep it high level. If you can't do it for every test, do it for a series of product iterations - onboarding flow, a new experience for a new feature, the initial activity, etc.
Assign an owner for a product feature (with its baseline metrics): A few years ago, I’d take the total volume of active A/B tests and equally divide it between my analysts, with more complex tests (like new price, product adoption, notification frequency) assigning to senior analysts and more straight forward (onboarding, upsell, UI change tests) to junior. Today, however, I am leaning towards solely dedicating an analyst to a product feature or experience (e.g. onboarding, activation, conversion, engagement, subscriptions) and letting them own their domain with whatever iteration the product team is testing, regardless of its level of complexity.
Automate statistical checks: test validations (variance, distribution, randomness) have to be automated today. Assign a dedicated DS person to oversee statistics across all tests, and let them own and automate it (via views or dashboarding). For example, back in my time at VidIQ, we introduced a consolidated experimentation dashboard that returned us a table with z score, p-value, variance, volume, and user distribution for every active test. It was such a time saver. Don't do manual checks for every test in your impact analysis, it’s not scalable.
Streamline documentation: your team has to simplify the test evaluation read, and it includes both communication on it and documentation. Readjust your expectation for the readout, for example, not a deck per test, but a slide per test. With such a volume of iterations, detailed notebooks and deep-dive impact analysis per test are not realistically doable anymore. We passed that stage when modern SaaS enabled PMs to make production changes with 1 click without coding or deploying.
I live in the past and thus still make attempts to document every product initiative, big or small, to understand how metrics move and user behavior evolves. Moving forward, I am leaning towards documenting only big changes that aim to disrupt user flow or behavior.
Eliminate the bias: At this very moment, I am brainstorming with my fellow analysts to create a way to quantify the bias of users getting into multiple tests. For example, create a new attribute of a cumulative number of currently active tests per user across revenue, onboarding, activation, core usage, secondary features, etc. I’m envisioning something like this:

If we would have this table, then we would be able to cherry-pick users with the least conflicting test overlaps for the test read. Then, maybe, we could wrap it into a downstream view of something like test_overlap_bias_score being between 1-10. This way, we could use users with 1-2 scores for high-impact test analysis for some important decision-making, and then compare it to users with 5-7 scores. Again, just brainstorming here. This is totally doable for companies with matured analytics.

Or we could try this solution shared by Eppo:

“Another method can be to run a regression every day on a user-level dataset predicting a core metric. In that regression, including a covariance for every active experiment and also every pair of experiments. If the pair-wise experiment coefficients show up as significant to some threshold, you'd see that there's an interaction effect issue.”

Best practice shared by experimentation experts:

Superwall:

“When using Superwall we'll automatically report back to you the inclusion of a user in an experiment. People usually then set that as an attribute on their user model to be able to break down by all the permutations of test variants.”.

Qonversion:

“There are a few different ways that help you avoid data contamination and optimize traffic for a large number of simultaneous experiments:
- Run mutually exclusive experiments. Usually, if experiments are located in the same app area and aimed at affecting the same metric, it’s highly recommended not to expose them to the same set of users. Such an approach does require having an advanced splitter at your disposal. However, eventually, you get crystal-clear data. For example, do not test onboarding changes and a journey to the Aha! moment on the same users.
- Expose your user to the experiment only once they reach a specific point in their user flow. For example, if you’re experimenting with a new paywall design, do not add users to the experiment on the app launch but once the paywall is shown.
- Keep a global control group (usually no more than 5%). This approach requires extra effort to maintain and correctly compare with your treatments. However, if you find it workable, you can avoid adding control groups to each experiment in favor of manually extracting the same segment for the comparison from the global control group.
Remember that having relevant users in your groups is crucial. Do not assign to the test those who are unaffected by the new experience at all. For example
- You’re validating the paywall-related monetization hypothesis, which means adding to the test users with active subscriptions is irrelevant.
- You’re testing new logic for guiding your users to the Aha! Moment, then exposing the test to those, who already have passed it, adds only extra noise to your data.”

Split:

“There is one area to keep in mind when running parallel experiments. That is the case where two changes directly interfere with one another, creating a different impact on behavior when combined than when in isolation. This happens when experiments are being run on the same page, the same user flow, etc. To avoid interaction effects, review concurrent experiments for interactions: use tags and naming conventions to track changed areas, manually test changes as part of the rollout, and look for other feature flags in nearby code. You should also design colliding tests to highlight interactions and compare each variant’s performance against one another.”

Remote configuration and rapid feature flagging open a new age of A/B testing. Nowadays, most experimentation toolkits are flexible and configurable. They empower any product leader to pretend that they have enough design and DS resources and can unleash their power of product iterations in no time. As exciting as that sounds, that means, as a data scientist, you won’t have the luxury of calculating the timeline for reaching significance, running the post-rollout impact check, working on a communication plan for every test, or creating a case study deck for every product change. Brace for a tsunami of a wild mix of kinda-test and not-a-full-rollout launches. And then: learn to swim in it.

Thanks for reading, everyone. Until next Wednesday!

Related publications:

Playbook For Launching, Monitoring, and Analyzing A/B Tests - Issue 134

Olga Berezovsky — Wed, 01 Mar 2023 13:01:36 GMT

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.

Subscribe now

Today, I will share my step-by-step process for launching, monitoring, analyzing, and reporting A/B tests. I’ll also cover the roles and expectations in test monitoring and support between product managers and data analysts, as this area of responsibility can overlap and is often a culprit of tensions.

Through working at different companies, I ended up developing a few different frameworks that were each depending on the data team structure and analyst reporting role. Depending on if analysts are embedded into Product, Business, or Engineering, or if they are part of a team squad or a tiger group, the framework will be different. That said, for today I’ll keep it at a high level to make it applicable to any organizational and reporting structure.

First thing first. I prefer to differentiate experimentation between introducing a new feature or optimizing an existing feature and classify all product tests into 3 categories:

Optimizing an existing product or feature.
Introducing a change to an existing product or feature.
Introducing a new product or feature that didn’t exist before.

While these sound similar (and in many, many companies are treated the same), they actually have different lifecycles and rollouts and should be evaluated differently.

A/B Test Checklist

Olga Berezovsky — Wed, 28 Dec 2022 06:35:42 GMT

✍️ Steps for conducting a product experiment:

Can you test it?

You can’t A/B test every little thing. New experiences or new product releases can’t be run through the A/B test (read - How To Measure Product Adoption). Potential bias - novelty effect or change aversion.

Formulate A Hypothesis

Why do you have to run the experiment? What is the ROI? Is it a good time to run the test? Consider seasonality, new version releases, open bugs, etc.

Set the rate you expect - this is your Minimum Detectable Effect (MDE). This is the smallest acceptable difference between the Control and Variant. If the Variant is 0.0001% better than the Control, would you still want to run the test? Is it worth the cost and time?

Finalize your set of metrics

For A/B analysis, I use a set of 3 metrics:

Success metrics
Ecosystem metrics (company KPIs)
Tradeoff metrics

More described here - How To Pick The Right Metric.

Define your audience:

Is the change relevant for new or active users, or all? The region, platform, and language? If you work with matured analytics, do you need to limit the test exposure to some type of persona?

The more user attributes and filters you add, the longer the test will likely run, as it reduces your sample size. That being said, it reduces variance as well, and thus, it will be more precise.

Calculate sample size:

Set your significance, confidence interval, and power.
Your group experiment sizes should be the same.
Your sample should be randomly distributed. Recognize traffic, devices, returning users, etc. Work with the engineering team on testing and ensuring that the randomization algorithm works as expected (hashing, clustering, sample stratification?).
Make sure there is no bias introduced with other tests running.

Run the test until you reach significance and a little longer. Monitor the test timeline and events.
Evaluate results:

Run sanity checks. Control metrics and conversions should match the Baseline. If they don’t, question the test setup.
Check sample variance and distribution. High variance often leads to low trust.
Run spot checks. Pick a few users from Control and Variant samples and check them to ensure they are random, not overlapping with other tests, and are meeting the test requirements.
If the result is not what you expected, think of the potential bias - novelty effect, learning effect, network effect.

Draw conclusions and provide a recommendation on the next steps to product owners.

🔥 Things to remember:

Run the A/A test first. It helps you check the software, outside factors, and natural variance. You would need to know the sample variance to estimate the significance level and statistical power.
Don’t pick metrics that are either too sensitive (views) or too robust (Day 7 or Day 30 retention). They are not helpful and tend to mislead you. The best test metric would show a change in the result and would not fluctuate much when other events are occurring.
Don't run the experiment for too long, as you might experience data pollution - the effect when multiple devices, cookies, and other outside factors affect your result.
Don’t run the experiment for too little time either, as you might get a false positive (regression to the mean). In other words, when a variable is extreme at first but then moves closer to the average.
When introducing a new change, run the test on a smaller sample for a longer period of time to eliminate the novelty or learning effect bias.

💣 Statistical terminology

To approach A/B testing, you can think of Null-Hypothesis testing and apply the following terms:

P-value - assuming Null-H is true, what is the probability of seeing a specific result? If data is in the "not expected" region, we reject Null-H.
Statistical Significance (or Significance level, alpha) is the probability of seeing the effect when none exists (false positive).
Statistical Power (or 1-beta) is the probability of seeing the effect when it does exist.
Confidence Interval is the number of allowed errors or measurements of estimated reliability: the smaller CI, the more accurate the result.
z-score is the number of Standard Deviations from the mean.

🤔 If you are lost in conversions and numbers, check this guide:

If your baseline conversion is 20%, you may set the MDE to 10%, and the test may detect 18% - 22% conversion results.
The higher your baseline conversion, the smaller the sample size you’ll need.
The smaller the MDE, the larger sample you’ll need.
Low p values are good. They indicate the result didn’t occur by chance.
Your confidence level could be 95% and statistical power - 80%
It’s often recommended to run the experiment for 2 business cycles (2-4 weeks)

📢 Use this calculator or this one to determine the needed sample size for your experiment.

📢 Use this calculator to evaluate your test significance and result.

🔍 Other types of product testing

Multivariate testing (MVT) - multiple variants and their combinations within a single test.
Split URL testing - multiple versions of your webpage posted on different URLs.
Multipage testing - testing changes across different pages. There is both funnel Multi-Page testing and Conventional Multi-Page testing. Read more here.

Check out this guide if you want an A/B experiment checklist.