Section B · Core DS

Advanced Experimentation

The methods senior-IC JDs name — multi-armed bandits, sequential testing — plus the variance reduction and interference handling that separate strong experimentation programs from the average.

Multi-armed bandits

The the Lead-DS JD lists "multi-armed bandits" alongside A/B tests. The bar is: you know when to use one, what algorithm you'd choose, and how to defend the tradeoff against a fixed-traffic A/B test.

What MABs are

A bandit dynamically allocates more traffic to better-performing arms during the test, rather than splitting 50/50 throughout. The result: less "regret" (revenue or conversion lost to the inferior arm) while still learning which arm is best.

When to use one

Short-lived decisions with reward feedback fast. Headlines, thumbnails, news ranking, ad creative. You care about cumulative reward over the test window, not just inference at the end.
Many arms. If you have 10 creative variants, an A/B with 10 arms costs 10× the sample. A bandit prunes losers fast.
Continuous deployment. No clean "ship/kill" moment — you want the system to keep optimizing.

When NOT to use one

You need a clean unbiased point estimate for the lift. Bandits trade inferential cleanliness for cumulative reward. The data is no longer i.i.d. across the test.
Slow reward feedback. If conversion takes 30 days to observe, the bandit can't adapt during the test window — degenerates into an A/B.
The change is structurally large or risky. Bandits aren't a safety mechanism. A canary or staged rollout is.
Stakeholder distrust. Bandits feel "automatic" to non-DS partners. If you can't explain the algorithm to a PM in 30 seconds, the lift is harder to defend.

Three algorithms to know

Algorithm	Intuition	Use when
ε-greedy	Pick the best arm with probability 1−ε, random with probability ε	Stationary world, simple to explain
UCB1	Pick the arm with the highest upper confidence bound on the reward	You want theoretical regret guarantees; reward is bounded
Thompson sampling	Sample each arm's reward from a posterior, pick the max	Reward has a clean conjugate prior (Beta-Binomial for CTR); Bayesian framing

thompson sampling for CTR

import numpy as np

class ThompsonBandit:
    def __init__(self, n_arms: int):
        # Beta(1,1) prior per arm — uniform on [0,1]
        self.alpha = np.ones(n_arms)
        self.beta  = np.ones(n_arms)

    def pull(self) -> int:
        # Sample from each arm's posterior; pick the best sample
        samples = np.random.beta(self.alpha, self.beta)
        return int(np.argmax(samples))

    def update(self, arm: int, reward: int):
        # reward in {0, 1}
        self.alpha[arm] += reward
        self.beta[arm]  += 1 - reward

Sequential testing

Classical A/B tests require committing to a sample size up front and not peeking. In practice, stakeholders peek. Sequential testing makes peeking valid.

mSPRT (mixture Sequential Probability Ratio Test)

Used by Optimizely and others. Continuously valid p-values — you can monitor every day, stop when the test crosses a threshold, and the Type I error stays bounded at α.

Group sequential boundaries

Pre-specify N interim looks and adjust the per-look α so cumulative Type I error stays at the target. Lan-DeMets spending functions are the standard implementation. Less flexible than mSPRT but easier to audit.

Bayesian sequential

Report posterior probability of treatment beating control. No sample-size commitment; stop when posterior crosses a threshold. Requires defensible priors, which is the catch.

When to bring this up

If an interviewer asks "how do you handle peeking?", "use a sequential test like mSPRT" is the strong answer. The weak answer is "don't peek." The weak answer is also true, but the strong answer signals you've worked at a scale where stakeholders peek anyway.

CUPED variance reduction

CUPED (Controlled-experiment Using Pre-Experiment Data) reduces the variance of your treatment effect estimate by adjusting for a pre-period covariate. Microsoft published it; it's standard at every mature experimentation platform.

The idea

If you knew each user's pre-period metric value, you could subtract a slope × pre from their post-period observation. The slope is the OLS regression coefficient of post on pre. The resulting adjusted outcomes have lower variance — sometimes 30–50% lower — at zero cost in unbiasedness.

CUPED estimator

import numpy as np

def cuped_estimate(y_treatment, y_control, x_treatment, x_control):
    """y = post-period metric, x = pre-period metric (per user)."""
    y = np.concatenate([y_treatment, y_control])
    x = np.concatenate([x_treatment, x_control])
    # theta = cov(y, x) / var(x)
    theta = np.cov(y, x, ddof=1)[0, 1] / np.var(x, ddof=1)
    y_t_adj = y_treatment - theta * (x_treatment - np.mean(x))
    y_c_adj = y_control   - theta * (x_control   - np.mean(x))
    return np.mean(y_t_adj) - np.mean(y_c_adj), y_t_adj, y_c_adj

When it helps most

Pre-period metric is highly correlated with post-period (true for any "engagement" metric — heavy users stay heavy).
Users have meaningful pre-period history (new users have noisy or missing pre-period values).

When to be careful

New-user experiments: no pre-period to adjust on. CUPED degenerates to the unadjusted estimate.
Composition shift: if treatment changes who shows up, the covariate is no longer "pre-treatment" in the right sense.

Switchback tests

For marketplaces, ride-hailing, food delivery — anywhere the unit of consumption can't be cleanly randomized because supply is shared. Treatment is applied to a whole region for a time window, then switched. Compare metrics during treatment windows vs control windows.

Pros: handles interference (you don't have control riders and treatment riders competing for the same drivers).
Cons: huge variance (each "unit" is a region-window, you have many fewer of those than individual users); requires careful handling of carryover (does last hour's pricing affect this hour's demand?); time-of-day confounding is brutal.

Network & cluster designs

When SUTVA breaks (treatment of one user affects another), cluster-randomize at the level that contains the spillover. Examples:

Social: randomize at the friend-cluster level via community detection.
Geographic: randomize at the city or DMA level.
Marketplace: randomize at the market or time-of-day level (switchback).

The cost: many fewer "units" → variance balloons. The "cluster effective sample size" can be a tiny fraction of the raw user count if within-cluster correlation is high.

Ratio metrics & the delta method

Click-through rate, conversion rate, revenue per user — anything of the form sum(numerator) / sum(denominator). The catch: if both numerator and denominator vary across users, the variance of the ratio isn't the same as the variance of the per-user mean.

Why this matters

Two ways to compute CTR:

User-level: compute each user's CTR (clicks/impressions), then average across users.
Pooled: total clicks across all users / total impressions across all users.

Pooled is usually what stakeholders want ("our CTR is X%"). User-level is what t-tests assume. If a few heavy users dominate impressions, the two estimates can disagree meaningfully. The delta method gives a valid standard error for the pooled ratio:

delta method SE for a pooled ratio

import numpy as np

def delta_method_ratio_var(num: np.ndarray, den: np.ndarray) -> float:
    """Per-user num and den arrays. Returns variance of sum(num)/sum(den)."""
    n = len(num)
    mean_n, mean_d = np.mean(num), np.mean(den)
    var_n, var_d = np.var(num, ddof=1), np.var(den, ddof=1)
    cov = np.cov(num, den, ddof=1)[0, 1]
    ratio = mean_n / mean_d
    var = (var_n / mean_d**2
           - 2 * ratio * cov / mean_d**2
           + ratio**2 * var_d / mean_d**2) / n
    return var

Interference & spillover

Beyond clean SUTVA cases, real interference includes:

Network spillover: a treated user shares treated content with a control user.
Marketplace interference: treated sellers cannibalize control-seller sales.
Inventory contention: a treatment that drives more demand pulls supply from control.
Stack interference: a treatment that increases latency degrades service for everyone, including control.

The strong-answer move: name the interference, name the cluster level that would contain it, name the cost (variance loss), and pick a design honestly.

Interview probes

Show probe 1: "When would you pick a multi-armed bandit over an A/B test?"

Three conditions need to hold: (1) reward feedback is fast (seconds to minutes); (2) the decision is "which option performs best right now" rather than "what's the unbiased lift estimate"; (3) you have many arms or the cost of exploration is high. Canonical fits: ad creative, news ranking, headline testing. Don't use a bandit when you need a defensible point estimate for stakeholder review.

Show probe 2: "Explain CUPED in two sentences."

CUPED reduces the variance of an A/B test by regressing the post-period metric on a pre-period covariate (typically the same metric, measured pre-experiment) and analyzing the residual instead of the raw outcome. The result is unbiased and has lower variance when pre and post are correlated — often shrinking required sample size by 30–50%.

Show probe 3: "What's a switchback test, and what makes it hard?"

A design where treatment is applied to a whole region for a time window, then switched. Used when individual randomization would create interference (marketplaces, supply-side systems). Hardness: time-of-day confounding, carryover between windows, very few effective units (each region-window is one observation), and the analysis has to account for autocorrelation. Variance is huge — you need many region-windows.

Show probe 4: "What's the delta method good for?"

Computing valid standard errors for ratio metrics like pooled CTR or revenue-per-user, where the numerator and denominator both vary across users and are correlated. Without it, naive t-tests on per-user ratios can disagree with the pooled estimate stakeholders actually report — and the disagreement isn't obvious until you investigate.

Show probe 5: "How do you handle a SUTVA violation in a marketplace test?"

Cluster-randomize at the level that contains the spillover. For ride-hailing or food delivery: switchback or geo-randomized. For social: friend-cluster randomization. Accept the variance hit — fewer effective units — and budget for a longer test or a larger effect size. The wrong answer is "ignore it and randomize at the user level anyway" — the lift estimate will be biased toward zero in marketplace tests where supply is contested.