Causal Inference Methods for Bridging Experiments and Strategic Impact - Issue 267
How to connect A/B test results to real-world business decisions - lessons from Roblox
Welcome to the Data Analysis Journal, a weekly newsletter about data science and analytics.
Today I want to introduce you to someone great - a fellow data scientist, Wenjing Zheng, Senior Data Science Manager at Roblox. I met Wenjing in May at Data Council, and today I want to share her insights on connecting A/B test results to real-world business decisions at Roblox.
This is what data scientists spend most of their time doing - teasing apart the true impact of A/B tests from holidays and seasonality, from external factors driving change, and from other campaigns running in parallel. It’s hard, like explaining why a true +10% lift in transactions shows up as only a 0.01% ARR increase. I’ve shared some of my methods of estimating such impact before, and today, I want to show how Roblox is doing it.
Read below a recap of Wenjing’s talk on causal inference methods that help bridge the gap between clean experimental results and messy strategic decisions, how to attribute business growth to product launches, or generalize experiment outcomes to broader populations.
Roblox is an online platform where users can both play and create their own games, called "experiences". Wenjing leads a data science team responsible for the experimentation platform, which hosts hundreds of tests and manages time-series tooling for forecasting, business monitoring, anomaly detection, and root cause analysis. The team uses multiple causal inference methods to enable and support data science partners.
Below, I share slides and takeaways on how Wenjing’s team approaches estimations for A/B tests' impact.
How to segment out local vs. global impact?
Individual teams at Roblox run experiments on surfaces like the homepage, notifications, marketplace, etc. Each experiment has a local lift, like +1% increase in time spent from a new notification. But local lift ≠ global impact: some surfaces get lots of traffic (e.g., homepage), while others have niche audiences.
When leadership tries to prioritize based on impact, for example, “Team A improved time spent by 1%, Team B by 0.1%”, it ignores:
Reach: How many users are exposed.
Baseline: Was it already optimized or easy to move?
Typically, in this case, teams use qualitative intuition (“our surface has less reach,” “this was a hard problem,” etc.), leading to inconsistent prioritization. We need better ways to quantify it.
Keep reading with a 7-day free trial
Subscribe to Data Analysis Journal to keep reading this post and get 7 days of free access to the full post archives.


