Advanced Experimentation
The methods senior-IC JDs name — multi-armed bandits, sequential testing — plus the variance reduction and interference handling that separate strong experimentation programs from the average.
Multi-armed bandits
The the Lead-DS JD lists "multi-armed bandits" alongside A/B tests. The bar is: you know when to use one, what algorithm you'd choose, and how to defend the tradeoff against a fixed-traffic A/B test.
What MABs are
A bandit dynamically allocates more traffic to better-performing arms during the test, rather than splitting 50/50 throughout. The result: less "regret" (revenue or conversion lost to the inferior arm) while still learning which arm is best.
When to use one
- Short-lived decisions with reward feedback fast. Headlines, thumbnails, news ranking, ad creative. You care about cumulative reward over the test window, not just inference at the end.
- Many arms. If you have 10 creative variants, an A/B with 10 arms costs 10× the sample. A bandit prunes losers fast.
- Continuous deployment. No clean "ship/kill" moment — you want the system to keep optimizing.
When NOT to use one
- You need a clean unbiased point estimate for the lift. Bandits trade inferential cleanliness for cumulative reward. The data is no longer i.i.d. across the test.
- Slow reward feedback. If conversion takes 30 days to observe, the bandit can't adapt during the test window — degenerates into an A/B.
- The change is structurally large or risky. Bandits aren't a safety mechanism. A canary or staged rollout is.
- Stakeholder distrust. Bandits feel "automatic" to non-DS partners. If you can't explain the algorithm to a PM in 30 seconds, the lift is harder to defend.
Three algorithms to know
| Algorithm | Intuition | Use when |
|---|---|---|
| ε-greedy | Pick the best arm with probability 1−ε, random with probability ε | Stationary world, simple to explain |
| UCB1 | Pick the arm with the highest upper confidence bound on the reward | You want theoretical regret guarantees; reward is bounded |
| Thompson sampling | Sample each arm's reward from a posterior, pick the max | Reward has a clean conjugate prior (Beta-Binomial for CTR); Bayesian framing |
import numpy as np
class ThompsonBandit:
def __init__(self, n_arms: int):
# Beta(1,1) prior per arm — uniform on [0,1]
self.alpha = np.ones(n_arms)
self.beta = np.ones(n_arms)
def pull(self) -> int:
# Sample from each arm's posterior; pick the best sample
samples = np.random.beta(self.alpha, self.beta)
return int(np.argmax(samples))
def update(self, arm: int, reward: int):
# reward in {0, 1}
self.alpha[arm] += reward
self.beta[arm] += 1 - reward
Sequential testing
Classical A/B tests require committing to a sample size up front and not peeking. In practice, stakeholders peek. Sequential testing makes peeking valid.
mSPRT (mixture Sequential Probability Ratio Test)
Used by Optimizely and others. Continuously valid p-values — you can monitor every day, stop when the test crosses a threshold, and the Type I error stays bounded at α.
Group sequential boundaries
Pre-specify N interim looks and adjust the per-look α so cumulative Type I error stays at the target. Lan-DeMets spending functions are the standard implementation. Less flexible than mSPRT but easier to audit.
Bayesian sequential
Report posterior probability of treatment beating control. No sample-size commitment; stop when posterior crosses a threshold. Requires defensible priors, which is the catch.
If an interviewer asks "how do you handle peeking?", "use a sequential test like mSPRT" is the strong answer. The weak answer is "don't peek." The weak answer is also true, but the strong answer signals you've worked at a scale where stakeholders peek anyway.
CUPED variance reduction
CUPED (Controlled-experiment Using Pre-Experiment Data) reduces the variance of your treatment effect estimate by adjusting for a pre-period covariate. Microsoft published it; it's standard at every mature experimentation platform.
The idea
If you knew each user's pre-period metric value, you could subtract a slope × pre from their post-period observation. The slope is the OLS regression coefficient of post on pre. The resulting adjusted outcomes have lower variance — sometimes 30–50% lower — at zero cost in unbiasedness.
import numpy as np
def cuped_estimate(y_treatment, y_control, x_treatment, x_control):
"""y = post-period metric, x = pre-period metric (per user)."""
y = np.concatenate([y_treatment, y_control])
x = np.concatenate([x_treatment, x_control])
# theta = cov(y, x) / var(x)
theta = np.cov(y, x, ddof=1)[0, 1] / np.var(x, ddof=1)
y_t_adj = y_treatment - theta * (x_treatment - np.mean(x))
y_c_adj = y_control - theta * (x_control - np.mean(x))
return np.mean(y_t_adj) - np.mean(y_c_adj), y_t_adj, y_c_adj
When it helps most
- Pre-period metric is highly correlated with post-period (true for any "engagement" metric — heavy users stay heavy).
- Users have meaningful pre-period history (new users have noisy or missing pre-period values).
When to be careful
- New-user experiments: no pre-period to adjust on. CUPED degenerates to the unadjusted estimate.
- Composition shift: if treatment changes who shows up, the covariate is no longer "pre-treatment" in the right sense.
Switchback tests
For marketplaces, ride-hailing, food delivery — anywhere the unit of consumption can't be cleanly randomized because supply is shared. Treatment is applied to a whole region for a time window, then switched. Compare metrics during treatment windows vs control windows.
- Pros: handles interference (you don't have control riders and treatment riders competing for the same drivers).
- Cons: huge variance (each "unit" is a region-window, you have many fewer of those than individual users); requires careful handling of carryover (does last hour's pricing affect this hour's demand?); time-of-day confounding is brutal.
Network & cluster designs
When SUTVA breaks (treatment of one user affects another), cluster-randomize at the level that contains the spillover. Examples:
- Social: randomize at the friend-cluster level via community detection.
- Geographic: randomize at the city or DMA level.
- Marketplace: randomize at the market or time-of-day level (switchback).
The cost: many fewer "units" → variance balloons. The "cluster effective sample size" can be a tiny fraction of the raw user count if within-cluster correlation is high.
Ratio metrics & the delta method
Click-through rate, conversion rate, revenue per user — anything of the form sum(numerator) / sum(denominator). The catch: if both numerator and denominator vary across users, the variance of the ratio isn't the same as the variance of the per-user mean.
Why this matters
Two ways to compute CTR:
- User-level: compute each user's CTR (clicks/impressions), then average across users.
- Pooled: total clicks across all users / total impressions across all users.
Pooled is usually what stakeholders want ("our CTR is X%"). User-level is what t-tests assume. If a few heavy users dominate impressions, the two estimates can disagree meaningfully. The delta method gives a valid standard error for the pooled ratio:
import numpy as np
def delta_method_ratio_var(num: np.ndarray, den: np.ndarray) -> float:
"""Per-user num and den arrays. Returns variance of sum(num)/sum(den)."""
n = len(num)
mean_n, mean_d = np.mean(num), np.mean(den)
var_n, var_d = np.var(num, ddof=1), np.var(den, ddof=1)
cov = np.cov(num, den, ddof=1)[0, 1]
ratio = mean_n / mean_d
var = (var_n / mean_d**2
- 2 * ratio * cov / mean_d**2
+ ratio**2 * var_d / mean_d**2) / n
return var
Interference & spillover
Beyond clean SUTVA cases, real interference includes:
- Network spillover: a treated user shares treated content with a control user.
- Marketplace interference: treated sellers cannibalize control-seller sales.
- Inventory contention: a treatment that drives more demand pulls supply from control.
- Stack interference: a treatment that increases latency degrades service for everyone, including control.
The strong-answer move: name the interference, name the cluster level that would contain it, name the cost (variance loss), and pick a design honestly.
Interview probes
Show probe 1: "When would you pick a multi-armed bandit over an A/B test?"
Three conditions need to hold: (1) reward feedback is fast (seconds to minutes); (2) the decision is "which option performs best right now" rather than "what's the unbiased lift estimate"; (3) you have many arms or the cost of exploration is high. Canonical fits: ad creative, news ranking, headline testing. Don't use a bandit when you need a defensible point estimate for stakeholder review.
Show probe 2: "Explain CUPED in two sentences."
CUPED reduces the variance of an A/B test by regressing the post-period metric on a pre-period covariate (typically the same metric, measured pre-experiment) and analyzing the residual instead of the raw outcome. The result is unbiased and has lower variance when pre and post are correlated — often shrinking required sample size by 30–50%.
Show probe 3: "What's a switchback test, and what makes it hard?"
A design where treatment is applied to a whole region for a time window, then switched. Used when individual randomization would create interference (marketplaces, supply-side systems). Hardness: time-of-day confounding, carryover between windows, very few effective units (each region-window is one observation), and the analysis has to account for autocorrelation. Variance is huge — you need many region-windows.
Show probe 4: "What's the delta method good for?"
Computing valid standard errors for ratio metrics like pooled CTR or revenue-per-user, where the numerator and denominator both vary across users and are correlated. Without it, naive t-tests on per-user ratios can disagree with the pooled estimate stakeholders actually report — and the disagreement isn't obvious until you investigate.
Show probe 5: "How do you handle a SUTVA violation in a marketplace test?"
Cluster-randomize at the level that contains the spillover. For ride-hailing or food delivery: switchback or geo-randomized. For social: friend-cluster randomization. Accept the variance hit — fewer effective units — and budget for a longer test or a larger effect size. The wrong answer is "ignore it and randomize at the user level anyway" — the lift estimate will be biased toward zero in marketplace tests where supply is contested.