How To Develop a Highly Trusted Experiment Analysis Workflow
How to get trustworthy and actionable insights from A/B tests using SQL - a guest post by Chetan Sharma, CEO of Eppo
Good morning and welcome to another edition of the Data Analysis Journal newsletter, where I write about data analysis, data science, and product analytics.
With all the planning and thinking ahead of the new year, I am on the lookout for successful frameworks and practices that I could bring to my team at MyFitnessPal to supercharge our experimentation analytics. With so many tests running, how can someone ensure that the data science and analytics team keeps the balance between high trust and timeline flexibility? How do they leverage the right sources for data evaluation? How about producing consistent reporting on many different and unique A/B tests running in parallel? And what about centralizing experimentation reporting while at the same time also allowing a self-serve investigation?
With so many questions in mind, I invited Chetan Sharma, founder and CEO of Eppo, a next-gen AB experimentation platform for data science and product teams. Chetan was also an early Data Scientist at Airbnb, where he spent five years working on ML fraud detection, logging integrity, mobile product analytics, and knowledge management. Today, he is one the best experts on experimentation, and I am so excited for him to walk us through the process, caveats, challenges, best practices, and even SQL to provide guidance on how to develop a trustworthy and actionable workflow for A/B tests analysis.
How To Develop a Highly Trusted Experiment Analysis Workflow by
Where trust meets the process
The ROI of analytics work can be squishy. Engineering, marketing, sales, and design all have clear connections to revenue by building and selling a product, but analytics doesn’t do any of those things. Instead, analytics has ROI by making bad decisions into good decisions. Introducing data into a decision process changes a team’s path to one that will yield more revenue, less loss, or achieve other strategic outcomes.
The good news is that data teams have a tool that plugs metrics directly into product decision-making: AB experiments. Experiments are the purest form of metrics, determining consequential decisions of product direction, marketing campaign continuation, and ultimately, organizational investment.
The bad news is that driving highly consequential decisions requires trust. It’s often a more significant amount of trust compared to simple product analytics asks like “how many people clicked that button?” Weeks of effort from a group of engineers, designers, and product will come down to a number. And to make things worse, that number is probably going to be something esoteric like a p-value.
When running an experimentation program, ROI doesn’t come from real-time pipelines or launch speed. It comes from reliably rigorous decisions. It comes from trust.
How to achieve trustworthy experiment results
This post is about the pitfalls of getting an organization to trust experiment results, and how to produce an experimentation system that product teams are excited to use (with SQL!).
The path to experiment program success is:
Centralize the experiment analysis process
Proactively run diagnostics
Produce consistent, accessible reporting
Enable people to run sanity checks and basic investigations
Centralize the experiment analysis process
As anyone who has worked with data knows, the simplest query requires a bunch of bespoke decisions. A question of “how many purchases did we get in China last week?” involves choosing i.) what counts as a “purchase” ii.) does the user have to be primarily China-based or in China at the moment of purchase, and iii.) does the week start on Sunday or Monday?
Experimentation is the same problem but on steroids. While you probably are ok with an approximately-correct, +/- 10% answer to bookings in China, in the experimentation world, the entire product launch’s effect is probably 5% or less. Precision matters.
Similar to business intelligence reporting, this means that experiment analysis should align on a core set of analysis choices, and enforce their usage. Here’s a short list of analysis choices that allow too many degrees of freedom when done ad-hoc, and should instead be centralized and vetted: metrics definition, experiment time windows, identity resolution, mixed groups, and statistical test types.
Metric Definitions
Here’s an example metric query to calculate purchases; one you might see at many growth companies:
-- Purchases query
SELECT
a.ts
, a.product_id
, a.user_id
, b.cost
FROM core_db_snapshots.purchases a
JOIN products b ON (a.product_id = b.id)
WHERE a.status IN (1,2)
In a highly trusted analysis workflow, you shouldn’t have to trust that the analyst remembered the WHERE a.status IN (1,2) clause, or that they included mobile purchases, or offline purchases. A centralized analysis workflow makes everyone use the same metric definition logic.
Experiment Windows
Most experiment analyses end up with a query that looks like this:
SELECT
a.*
, e.*
FROM experiment_assigned_users a
JOIN purchase_events e ON (a.user_id = e.event_user_id)
WHERE a.ts_assigned <= e.ts
AND a.ts_assigned BETWEEN {{ experiment_start_time }} AND {{ experiment_end_time }}
AND e.ts BETWEEN {{ experiment_start_time }} AND {{ experiment_end_time }}
Notice that there’s a lot of time logic here. Time logic seems simple, but starts to get complicated when you have UTC timestamps vs. local timezone datestamps, or have > vs. >= choices with two datestamps. And if you end up building advanced features, like CUPED variance reduction, it gets even weirder, as you need to strictly separate pre-experiment from post-experiment.
Identity Resolution and Mixed Groups
Growth teams love running AB experiments on new users, randomizing cookies or devices. This gets tricky for two reasons:
Your metrics are likely sourced from application database tables that don’t have cookies, only user IDs.
Some users end up on both sides of the experiment when they end up in the baseline group on their phone and the new UX on their desktop.
You’ll probably need to run a couple of queries like this:
WITH assignments_user_imputed AS (
SELECT
a.*
, b.user_id
FROM assignments a
LEFT JOIN anonymous_user_mapping b ON (a.anonymous_id = b.anonymous_id)
-- Ignore for now that you'll probably want to use a type 2 dimensional table
),
assignments_mixed_flagged as (
SELECT
user_id
, MIN(ts_assigned) as ts_assigned
, COUNT(DISTINCT experiment_group) as m_experiment_groups
, CASE WHEN m_experiment_groups > 1 THEN 'mixed' ELSE 'ok' END as is_mixed
FROM assignments_user_imputed
GROUP BY user_id
),
SELECT * FROM assignments_mixed_flagged WHERE is_mixed = 'ok'
On top of that, it’s good to have diagnostics to make sure there isn’t too much cross-experiment group contamination going on, and you’ll need to run even splits to make sure the contamination doesn’t happen too much in one direction. This is a deep topic.
Statistical Tests
Experimentation comes in different types and flavors, and the data science and analytics teams need a way to set up a statistical test type.
Unfortunately, statistical tests don’t have an out-of-the-box implementation in a SQL-centric, dbt + Tableau workflow. You’ll have to get creative here, jamming a python script into your DAG or even coding confidence interval logic in SQL.
But even if you use Python in a Jupyter notebook, you’ll need to make choices around:
What test to use (t-test vs. bayesian vs. sequential)
Matching testing method to whether the metric is binary, continuous, or ratio. (eg. ratios require the delta method)
One-tailed vs. two-tailed tests
Most growth companies have a handful of people who are equipped to make these choices. Centralizing the workflow lets their knowledge scale across the org.
Proactively run diagnostic tests
Even at the most sophisticated experimentation programs like Airbnb and Netflix, it’s incredibly common for there to be an experiment setup issue. The two most common ones are traffic imbalances and outliers.
Traffic imbalances can be detected with a sample ratio mismatch test, which is a chi-squared test under the hood. Outliers can be detected with statistical tests or heuristics, but you can also effectively mitigate them without detection with winsorization and CUPED:
Bending time in experimentation
With diagnostics, it’s not just important to catch issues. It’s important to catch them:
Early on: There is an enormous difference in trust capital burned between a team feeling like they wasted weeks waiting on a test that turned out to be broken, compared to diagnosing, repairing, and relaunching a test within the first day.
Comprehensively: people reading experiment reports shouldn’t wonder whether anyone ran a diagnostic test. And even worse, memorialized experiment successes that are later revealed to be buggy setups are incredibly bad for trust in experimentation.
Produce consistent, accessible reporting
Remember, experiments are making decisions on whether weeks of engineering and design effort will be mainlined or dismantled, or whether people will get promoted and teams resourced. Here’s a smattering of home-grown AB testing reports from a few seconds of googling:
Each of these reports shows numbers and charts. The analytics and statistics are probably all correct. But I am guessing the analysts behind these reports have to do tons of post-experiment investigations so that their product teams can convince themselves that the numbers are correct.
The more unintuitive and consequential the experiment result is, the more publicly visible it’ll be. If a CPO received a URL of your experiment result, it’s worth investing in their confidence and excitement in what he or she sees:
Self-serve investigations
When we built Airbnb’s experimentation system, one of the early bets that had a huge payoff was enabling users to easily slice-dice experiment results. This segmentation feature, along with the linear trend lines, was an enormous boost to the experimentation program. Along with saving the data team a ton of manual analysis time, it also allowed experimenters to easily gain trust in the results.
There are a few reasons for this:
A few cuts by device, region, and subscription tier can quickly uncover whether the entire experiment effect is due to a product bug.
Trends over time give a sense of metric stability and underlying data quality.
Segmentation cuts let readers personalize the result to their domain area. A marketer can segment by channel, a product leader segment by their user persona, and ops leader can target by plan type.
Every consumer of analytics reports has learned to twist and turn reports around from different angles to build their own self-confidence. The investigations aren’t just to answer an analytical question, they also convince decision-makers that the results aren’t due to an artifact.
Invest in trust, get impact
After a decade that saw Airbnb, Netflix, and Facebook each win their market with AB experimentation, the practice has gone mainstream. AB test analysis is part of analytics boot camps, PM interviews, and even business school curriculums.
But while the components of AB experiments are randomizing, calculating metrics, and running stat tests, great components weren’t what made Airbnb’s experimentation program great. The great experimentation programs are willing to let customer desires (metrics) guide a huge host of product decisions. And in the practice, all of the ETL and warehousing investments of the data team are structurally driving revenue-improving decisions.
Thanks, Chetan!
Find and connect with Chetan on LinkedIn and Twitter.
Thanks for reading, everyone. Until next Wednesday!