Drill — Answers Hidden by Default

Practice Interview Questions

30 questions across 8 categories, calibrated to product/analytics DS loops at AI companies. Read each question, give your version on a 90-second timer, then reveal to compare.

🎯 30 questions ⏱ ~90 sec each 📊 Progress saved

Drill mode — hide all answers 0 / 30 practiced

Section A · Background / motivation

Q1. "Walk me through your background."

Show strong answer

"I do [current role]. My unfair advantage in this work is [specific edge]. Most recently I [one project — what changed because of it, measurably]. The reason this role caught me is [specific line from the JD] — that's exactly the work I want to do next. The thing I'd be ramping on is [honest gap], and I have a concrete plan for closing it."

The shape is: edge → one project → bridge to this role → honest gap.

Q2. "Why this company / role?"

Show strong answer

Point at one specific line in the JD or one specific thing the company is doing. Generic "I love AI" gets ignored. "The line about defining the experimentation stack — that's the work I most want to do, because I've spent two years interpreting other people's experiments and I want to design the next one" is hard to fake and easy to verify in the rest of the conversation.

Q3. "Tell me about a time an analysis you did changed a decision."

Show strong answer

STAR format. Situation: the decision on the table, the stakes. Task: the analytical question. Action: what method, why that method, what you almost did instead. Result: the decision that changed, the downstream metric impact. Most candidates skip the "almost did instead" — naming it signals you're aware methods have alternatives.

Q4. "Tell me about an analysis that was wrong, and what you did."

Show strong answer

Have a real one. Not a near-miss. Frame: what you concluded, what you missed, how you discovered it, what you did once you found out, what changed in your process going forward. The last beat is most important — interviewers want to know you have a feedback loop.

Section B · SQL & data

Q5. "Difference between INNER JOIN, LEFT JOIN, FULL JOIN?"

Show strong answer

INNER returns only matched rows. LEFT keeps every row in the left table; unmatched right side is NULL. FULL keeps every row in both, NULLs where no match. The trap interviewers love: WHERE right.col = 'x' on a LEFT JOIN secretly turns it into an inner join because NULLs fail the equality. Move that into the ON clause if you want to preserve the LEFT semantics.

Q6. "Compute 7-day rolling DAU in SQL."

Show strong answer

Daily distinct user counts as the base table, then AVG() with a window ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. Must use ROWS, not RANGE, for date gaps to behave. See 03-sql-for-product-analytics.

Q7. "Why does NOT IN sometimes return zero rows?"

Show strong answer

If the subquery returns even one NULL, every NOT IN comparison evaluates to UNKNOWN, which filters all rows. NOT EXISTS handles NULLs correctly. Default to NOT EXISTS for correlated lookups.

Q8. "How would you compute MAU as a 28-day rolling window per day?"

Show strong answer

The catch: COUNT(DISTINCT) doesn't work in a window frame. Workarounds: (a) precompute first-seen/last-seen per user and aggregate; (b) self-join with a 28-day date window and count distinct user_ids per day; (c) HyperLogLog approximate distinct counts if your dialect supports them (BigQuery, Redshift). Each has tradeoffs; mention all three and pick the simplest that scales.

Section C · Experimentation

Q9. "What does p < 0.05 mean?"

Show strong answer

If the null hypothesis (no real effect) is true, there's less than a 5% chance of seeing a test statistic at least as extreme as what we observed. Not "5% chance the null is true" — that's a misstatement. A long-run frequency property of the test under the null.

Q10. "Walk me through designing an A/B test for moving an upgrade CTA."

Show strong answer

Restate the decision → primary metric (weekly conversion to paid) → guardrails (latency, retention, support) → unit of randomization (user, sticky hash) → power calc and duration → decision rule written down → risks (novelty, SRM). See 04 §design-from-prompt for the full script.

Q11. "Why is peeking bad?"

Show strong answer

Repeated significance tests on accumulating data inflate Type I error. Ten looks can push effective false-positive rate to ~20%. Fix: commit to fixed sample size, or use a sequential test (mSPRT, group sequential) that's valid under continuous monitoring.

Q12. "When would you use a multi-armed bandit?"

Show strong answer

Three conditions: (1) fast reward feedback; (2) you care about cumulative reward over the test window, not unbiased lift; (3) many arms. Canonical fits: headlines, thumbnails, ad creative. Not a fit when you need a defensible point estimate for stakeholder review.

Q13. "What's CUPED?"

Show strong answer

Variance reduction by adjusting the post-period outcome with a pre-period covariate (typically the same metric measured pre-experiment). Unbiased; cuts required sample by 30–50% when pre and post are correlated. Standard at mature experimentation platforms.

Q14. "What's a SUTVA violation?"

Show strong answer

Stable Unit Treatment Value Assumption: one unit's outcome shouldn't depend on others' assignment. Marketplaces, social, supply-side constrained systems all violate it. Fix: cluster-randomize at the level that contains the spillover; accept higher variance.

Q15. "Your test ran 2 weeks, primary +3%, p=0.04. Ship?"

Show strong answer

"Probably, but I'd check: SRM clean? Pre-period balance OK? Any guardrail moved? Is the +3% distributed or concentrated in a slice? If all clean, ship and monitor in the post-period for novelty fade." Notice: the reasoning is the answer.

Section D · Causal inference

Q16. "When DiD over A/B?"

Show strong answer

I wouldn't, if I could randomize. DiD is for infeasible-randomization cases (regulatory rollouts, geo launches) where I have pre-period data on both treated and untreated. Key assumption: parallel trends. Defend it with pre-period plots and event-study tests.

Q17. "Key assumption of propensity scoring?"

Show strong answer

Ignorability — conditional on observed covariates, treatment is as good as random. Equivalently: no unobserved confounders. Strong assumption. Best when treatment assignment is well-understood and drivers are measurable.

Q18. "What's a weak instrument?"

Show strong answer

An instrument with low correlation to treatment (first-stage F < 10). Bias the IV estimate toward OLS and understate standard errors. Always report the F. If weak, the IV estimate isn't trustworthy.

Q19. "Marketing campaign ran in some cities, not others. How do you estimate causal effect?"

Show strong answer

DiD with parallel-trends checks is the standard play if cities have pre-period data. Synthetic control if it's one treated city with many candidate controls. Worry about spillover (cities aren't independent), and about selection (the chosen treated cities aren't random — try to recover ATT, not ATE).

Section E · Product sense

Q20. "Pick a north star for [the company's product]."

Show strong answer

Defend on four axes: ties to user value, movable on a useful horizon, sensitive, hard to game. For a founding product-DS role: "weekly active creators publishing ≥ 1 video." For an enterprise LLM API: "weekly active accounts with > N API calls" or "net revenue retention by cohort." Pair with leading + lagging indicators and a guardrail set.

Q21. "Metric X dropped 5%. What do you do?"

Show strong answer

Six-step protocol. (1) Is it real — check instrumentation. (2) Same denominator? (3) Slice by platform/source/country. (4) Correlate with releases. (5) Form hypothesis. (6) Validate with focused query or experiment. Update stakeholders within 24h with what we know / suspect / are doing — even if incomplete.

Q22. "How do you measure success for an AI assistant feature?"

Show strong answer

Layered. Adoption: weekly invokers. Engagement: invocations per active. Quality: acceptance rate, refusal/hallucination rate. Business impact: lift on the downstream metric the feature was supposed to improve. Guardrails: latency p95, cost per invocation. The senior signal: naming the cost economics.

Q23. "Funnel-step conversion vs overall conversion — pick one."

Show strong answer

"Both, for different audiences. Product team for step-conversion (what friction sits between steps?), exec for overall conversion (what's our funnel-top-to-bottom?). The two can move in opposite directions if funnel-top composition shifts, and always say which one you're showing."

Section F · Predictive modeling

Q24. "How would you forecast monthly revenue?"

Show strong answer

Start with seasonal-naïve baseline. Layer in exogenous features model (gradient boosting on lag features plus pipeline/seasonality). Walk-forward validation. Report point estimate + interval, decomposed into committed vs new-pipeline because they have different uncertainty profiles. Only beat seasonal naïve if you can show it on the holdout.

Q25. "Why calibrate a classifier?"

Show strong answer

AUC measures ranking, not whether 0.3 means 30% chance. Business stakeholders interpret scores as probabilities; decisions rely on calibrated thresholds. Isotonic or Platt scaling on a holdout set. Plot a reliability diagram to defend.

Q26. "Walk me through sizing a new product opportunity."

Show strong answer

Addressable universe → per-unit revenue estimate → realistic adoption curve → low/mid/high bounds → sensitivity analysis naming which assumption, if 20% off, flips the recommendation. Deliver as a one-pager with explicit assumptions, not a single number.

Section G · Leadership (analytics leadership)

Q27. "How do you set technical bar for a team?"

Show strong answer

Reusable rubric for 'done': question stated, methods defended, uncertainty quantified, recommendation explicit, limitations self-called. Review ritual that enforces it. Be the most senior reader on every important piece for 6 months. After that it becomes culture.

Q28. "Tell me about a time you killed work."

Show strong answer

Have a story ready. Format: what we were building, why it became clear it wouldn't ship usefully, what we documented from the partial work, how we reallocated. The hard part is the stakeholder conversation — lead with what we learned, then the kill, then a smaller follow-up.

Section H · Curveballs

Q29. "Two dashboards report MAU and disagree. What do you do?"

Show strong answer

Trace both definitions, identify the divergence, pick the right one, kill the other, document why. Long-term fix: metric layer (LookML, dbt semantic models, Cube). Communicate the discrepancy and the resolution to stakeholders proactively.

Q30. "When wouldn't you run an A/B test?"

Show strong answer

Five cases. (1) Reversible change with bounded downside — just ship. (2) Insufficient traffic for adequate power. (3) Unit you'd randomize on would contaminate (use switchback or quasi-experiment). (4) Required by compliance/contract — no kill option. (5) Metric too long-horizon to measure in test window — use a leading indicator and triangulate.

Drill protocol

How to drill

Enable drill mode. Read each question. Set a 90-second timer. Speak your answer out loud — out loud, not in your head. Reveal. Compare. Mark "practiced." Aim for 20+ of the 30 before any onsite. Reread the strong-answer phrasing for the ones you stumbled on.

Section A · Background / motivation

Q1. "Walk me through your background." practiced

Q2. "Why this company / role?" practiced

Q3. "Tell me about a time an analysis you did changed a decision." practiced

Q4. "Tell me about an analysis that was wrong, and what you did." practiced

Section B · SQL & data

Q5. "Difference between INNER JOIN, LEFT JOIN, FULL JOIN?" practiced

Q6. "Compute 7-day rolling DAU in SQL." practiced

Q7. "Why does NOT IN sometimes return zero rows?" practiced

Q8. "How would you compute MAU as a 28-day rolling window per day?" practiced

Section C · Experimentation

Q9. "What does p < 0.05 mean?" practiced

Q10. "Walk me through designing an A/B test for moving an upgrade CTA." practiced

Q11. "Why is peeking bad?" practiced

Q12. "When would you use a multi-armed bandit?" practiced

Q13. "What's CUPED?" practiced

Q14. "What's a SUTVA violation?" practiced

Q15. "Your test ran 2 weeks, primary +3%, p=0.04. Ship?" practiced

Section D · Causal inference

Q16. "When DiD over A/B?" practiced

Q17. "Key assumption of propensity scoring?" practiced

Q18. "What's a weak instrument?" practiced

Q19. "Marketing campaign ran in some cities, not others. How do you estimate causal effect?" practiced

Section E · Product sense

Q20. "Pick a north star for [the company's product]." practiced

Q21. "Metric X dropped 5%. What do you do?" practiced

Q22. "How do you measure success for an AI assistant feature?" practiced

Q23. "Funnel-step conversion vs overall conversion — pick one." practiced

Section F · Predictive modeling

Q24. "How would you forecast monthly revenue?" practiced

Q25. "Why calibrate a classifier?" practiced

Q26. "Walk me through sizing a new product opportunity." practiced

Section G · Leadership (analytics leadership)

Q27. "How do you set technical bar for a team?" practiced

Q28. "Tell me about a time you killed work." practiced

Section H · Curveballs

Q29. "Two dashboards report MAU and disagree. What do you do?" practiced

Q30. "When wouldn't you run an A/B test?" practiced

Drill protocol

Q1. "Walk me through your background."

Q2. "Why this company / role?"

Q3. "Tell me about a time an analysis you did changed a decision."

Q4. "Tell me about an analysis that was wrong, and what you did."

Q5. "Difference between INNER JOIN, LEFT JOIN, FULL JOIN?"

Q6. "Compute 7-day rolling DAU in SQL."

Q7. "Why does NOT IN sometimes return zero rows?"

Q8. "How would you compute MAU as a 28-day rolling window per day?"

Q9. "What does p < 0.05 mean?"

Q10. "Walk me through designing an A/B test for moving an upgrade CTA."

Q11. "Why is peeking bad?"

Q12. "When would you use a multi-armed bandit?"

Q13. "What's CUPED?"

Q14. "What's a SUTVA violation?"

Q15. "Your test ran 2 weeks, primary +3%, p=0.04. Ship?"

Q16. "When DiD over A/B?"

Q17. "Key assumption of propensity scoring?"

Q18. "What's a weak instrument?"

Q19. "Marketing campaign ran in some cities, not others. How do you estimate causal effect?"

Q20. "Pick a north star for [the company's product]."

Q21. "Metric X dropped 5%. What do you do?"

Q22. "How do you measure success for an AI assistant feature?"

Q23. "Funnel-step conversion vs overall conversion — pick one."

Q24. "How would you forecast monthly revenue?"

Q25. "Why calibrate a classifier?"

Q26. "Walk me through sizing a new product opportunity."

Q27. "How do you set technical bar for a team?"

Q28. "Tell me about a time you killed work."

Q29. "Two dashboards report MAU and disagree. What do you do?"

Q30. "When wouldn't you run an A/B test?"