Data Analysis Journal

Data Analysis Journal

A/B testing

Causal Inference Methods for Bridging Experiments and Strategic Impact - Issue 267

How to connect A/B test results to real-world business decisions - lessons from Roblox

Olga Berezovsky's avatar
Olga Berezovsky
Jul 16, 2025
∙ Paid

Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.


Today I want to introduce you to someone great - a fellow data scientist, Wenjing Zheng, Senior Data Science Manager at Roblox. I met Wenjing in May at Data Council, and today I want to share her insights on connecting A/B test results to real-world business decisions at Roblox.

This is what data scientists spend most of their time doing - teasing apart the true impact of A/B tests from holidays and seasonality, from external factors driving change, and from other campaigns running in parallel. It’s hard, like explaining why a true +10% lift in transactions shows up as only a 0.01% ARR increase. I’ve shared some of my methods of estimating such impact before, and today, I want to show how Roblox is doing it.

Read below a recap of Wenjing’s talk on causal inference methods that help bridge the gap between clean experimental results and messy strategic decisions, how to attribute business growth to product launches, or generalize experiment outcomes to broader populations.

Roblox is an online platform where users can both play and create their own games, called "experiences". Wenjing leads a data science team responsible for the experimentation platform, which hosts hundreds of tests and manages time-series tooling for forecasting, business monitoring, anomaly detection, and root cause analysis. The team uses multiple causal inference methods to enable and support data science partners.

Below, I share slides and takeaways on how Wenjing’s team approaches estimations for A/B tests' impact.

How to segment out local vs. global impact?

Individual teams at Roblox run experiments on surfaces like the homepage, notifications, marketplace, etc. Each experiment has a local lift, like +1% increase in time spent from a new notification. But local lift ≠ global impact: some surfaces get lots of traffic (e.g., homepage), while others have niche audiences.

When leadership tries to prioritize based on impact, for example, “Team A improved time spent by 1%, Team B by 0.1%”, it ignores:

  • Reach: How many users are exposed.

  • Baseline: Was it already optimized or easy to move?

Typically, in this case, teams use qualitative intuition (“our surface has less reach,” “this was a hard problem,” etc.), leading to inconsistent prioritization. We need better ways to quantify it.

Keep reading with a 7-day free trial

Subscribe to Data Analysis Journal to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Olga Berezovsky · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture