Practice Interview Questions
30 questions across 8 categories, calibrated to product/analytics DS loops at AI companies. Read each question, give your version on a 90-second timer, then reveal to compare.
Section A · Background / motivation
Q1. "Walk me through your background."
Show strong answer
"I do [current role]. My unfair advantage in this work is [specific edge]. Most recently I [one project — what changed because of it, measurably]. The reason this role caught me is [specific line from the JD] — that's exactly the work I want to do next. The thing I'd be ramping on is [honest gap], and I have a concrete plan for closing it."
The shape is: edge → one project → bridge to this role → honest gap.
Q2. "Why this company / role?"
Show strong answer
Point at one specific line in the JD or one specific thing the company is doing. Generic "I love AI" gets ignored. "The line about defining the experimentation stack — that's the work I most want to do, because I've spent two years interpreting other people's experiments and I want to design the next one" is hard to fake and easy to verify in the rest of the conversation.
Q3. "Tell me about a time an analysis you did changed a decision."
Show strong answer
STAR format. Situation: the decision on the table, the stakes. Task: the analytical question. Action: what method, why that method, what you almost did instead. Result: the decision that changed, the downstream metric impact. Most candidates skip the "almost did instead" — naming it signals you're aware methods have alternatives.
Q4. "Tell me about an analysis that was wrong, and what you did."
Show strong answer
Have a real one. Not a near-miss. Frame: what you concluded, what you missed, how you discovered it, what you did once you found out, what changed in your process going forward. The last beat is most important — interviewers want to know you have a feedback loop.
Section B · SQL & data
Q5. "Difference between INNER JOIN, LEFT JOIN, FULL JOIN?"
Show strong answer
INNER returns only matched rows. LEFT keeps every row in the left table; unmatched right side is NULL. FULL keeps every row in both, NULLs where no match. The trap interviewers love: WHERE right.col = 'x' on a LEFT JOIN secretly turns it into an inner join because NULLs fail the equality. Move that into the ON clause if you want to preserve the LEFT semantics.
Q6. "Compute 7-day rolling DAU in SQL."
Show strong answer
Daily distinct user counts as the base table, then AVG() with a window ROWS BETWEEN 6 PRECEDING AND CURRENT ROW. Must use ROWS, not RANGE, for date gaps to behave. See 03-sql-for-product-analytics.
Q7. "Why does NOT IN sometimes return zero rows?"
Show strong answer
If the subquery returns even one NULL, every NOT IN comparison evaluates to UNKNOWN, which filters all rows. NOT EXISTS handles NULLs correctly. Default to NOT EXISTS for correlated lookups.
Q8. "How would you compute MAU as a 28-day rolling window per day?"
Show strong answer
The catch: COUNT(DISTINCT) doesn't work in a window frame. Workarounds: (a) precompute first-seen/last-seen per user and aggregate; (b) self-join with a 28-day date window and count distinct user_ids per day; (c) HyperLogLog approximate distinct counts if your dialect supports them (BigQuery, Redshift). Each has tradeoffs; mention all three and pick the simplest that scales.
Section C · Experimentation
Q9. "What does p < 0.05 mean?"
Show strong answer
If the null hypothesis (no real effect) is true, there's less than a 5% chance of seeing a test statistic at least as extreme as what we observed. Not "5% chance the null is true" — that's a misstatement. A long-run frequency property of the test under the null.
Q10. "Walk me through designing an A/B test for moving an upgrade CTA."
Show strong answer
Restate the decision → primary metric (weekly conversion to paid) → guardrails (latency, retention, support) → unit of randomization (user, sticky hash) → power calc and duration → decision rule written down → risks (novelty, SRM). See 04 §design-from-prompt for the full script.
Q11. "Why is peeking bad?"
Show strong answer
Repeated significance tests on accumulating data inflate Type I error. Ten looks can push effective false-positive rate to ~20%. Fix: commit to fixed sample size, or use a sequential test (mSPRT, group sequential) that's valid under continuous monitoring.
Q12. "When would you use a multi-armed bandit?"
Show strong answer
Three conditions: (1) fast reward feedback; (2) you care about cumulative reward over the test window, not unbiased lift; (3) many arms. Canonical fits: headlines, thumbnails, ad creative. Not a fit when you need a defensible point estimate for stakeholder review.
Q13. "What's CUPED?"
Show strong answer
Variance reduction by adjusting the post-period outcome with a pre-period covariate (typically the same metric measured pre-experiment). Unbiased; cuts required sample by 30–50% when pre and post are correlated. Standard at mature experimentation platforms.
Q14. "What's a SUTVA violation?"
Show strong answer
Stable Unit Treatment Value Assumption: one unit's outcome shouldn't depend on others' assignment. Marketplaces, social, supply-side constrained systems all violate it. Fix: cluster-randomize at the level that contains the spillover; accept higher variance.
Q15. "Your test ran 2 weeks, primary +3%, p=0.04. Ship?"
Show strong answer
"Probably, but I'd check: SRM clean? Pre-period balance OK? Any guardrail moved? Is the +3% distributed or concentrated in a slice? If all clean, ship and monitor in the post-period for novelty fade." Notice: the reasoning is the answer.
Section D · Causal inference
Q16. "When DiD over A/B?"
Show strong answer
I wouldn't, if I could randomize. DiD is for infeasible-randomization cases (regulatory rollouts, geo launches) where I have pre-period data on both treated and untreated. Key assumption: parallel trends. Defend it with pre-period plots and event-study tests.
Q17. "Key assumption of propensity scoring?"
Show strong answer
Ignorability — conditional on observed covariates, treatment is as good as random. Equivalently: no unobserved confounders. Strong assumption. Best when treatment assignment is well-understood and drivers are measurable.
Q18. "What's a weak instrument?"
Show strong answer
An instrument with low correlation to treatment (first-stage F < 10). Bias the IV estimate toward OLS and understate standard errors. Always report the F. If weak, the IV estimate isn't trustworthy.
Q19. "Marketing campaign ran in some cities, not others. How do you estimate causal effect?"
Show strong answer
DiD with parallel-trends checks is the standard play if cities have pre-period data. Synthetic control if it's one treated city with many candidate controls. Worry about spillover (cities aren't independent), and about selection (the chosen treated cities aren't random — try to recover ATT, not ATE).
Section E · Product sense
Q20. "Pick a north star for [the company's product]."
Show strong answer
Defend on four axes: ties to user value, movable on a useful horizon, sensitive, hard to game. For a founding product-DS role: "weekly active creators publishing ≥ 1 video." For an enterprise LLM API: "weekly active accounts with > N API calls" or "net revenue retention by cohort." Pair with leading + lagging indicators and a guardrail set.
Q21. "Metric X dropped 5%. What do you do?"
Show strong answer
Six-step protocol. (1) Is it real — check instrumentation. (2) Same denominator? (3) Slice by platform/source/country. (4) Correlate with releases. (5) Form hypothesis. (6) Validate with focused query or experiment. Update stakeholders within 24h with what we know / suspect / are doing — even if incomplete.
Q22. "How do you measure success for an AI assistant feature?"
Show strong answer
Layered. Adoption: weekly invokers. Engagement: invocations per active. Quality: acceptance rate, refusal/hallucination rate. Business impact: lift on the downstream metric the feature was supposed to improve. Guardrails: latency p95, cost per invocation. The senior signal: naming the cost economics.
Q23. "Funnel-step conversion vs overall conversion — pick one."
Show strong answer
"Both, for different audiences. Product team for step-conversion (what friction sits between steps?), exec for overall conversion (what's our funnel-top-to-bottom?). The two can move in opposite directions if funnel-top composition shifts, and always say which one you're showing."
Section F · Predictive modeling
Q24. "How would you forecast monthly revenue?"
Show strong answer
Start with seasonal-naïve baseline. Layer in exogenous features model (gradient boosting on lag features plus pipeline/seasonality). Walk-forward validation. Report point estimate + interval, decomposed into committed vs new-pipeline because they have different uncertainty profiles. Only beat seasonal naïve if you can show it on the holdout.
Q25. "Why calibrate a classifier?"
Show strong answer
AUC measures ranking, not whether 0.3 means 30% chance. Business stakeholders interpret scores as probabilities; decisions rely on calibrated thresholds. Isotonic or Platt scaling on a holdout set. Plot a reliability diagram to defend.
Q26. "Walk me through sizing a new product opportunity."
Show strong answer
Addressable universe → per-unit revenue estimate → realistic adoption curve → low/mid/high bounds → sensitivity analysis naming which assumption, if 20% off, flips the recommendation. Deliver as a one-pager with explicit assumptions, not a single number.
Section G · Leadership (analytics leadership)
Q27. "How do you set technical bar for a team?"
Show strong answer
Reusable rubric for 'done': question stated, methods defended, uncertainty quantified, recommendation explicit, limitations self-called. Review ritual that enforces it. Be the most senior reader on every important piece for 6 months. After that it becomes culture.
Q28. "Tell me about a time you killed work."
Show strong answer
Have a story ready. Format: what we were building, why it became clear it wouldn't ship usefully, what we documented from the partial work, how we reallocated. The hard part is the stakeholder conversation — lead with what we learned, then the kill, then a smaller follow-up.
Section H · Curveballs
Q29. "Two dashboards report MAU and disagree. What do you do?"
Show strong answer
Trace both definitions, identify the divergence, pick the right one, kill the other, document why. Long-term fix: metric layer (LookML, dbt semantic models, Cube). Communicate the discrepancy and the resolution to stakeholders proactively.
Q30. "When wouldn't you run an A/B test?"
Show strong answer
Five cases. (1) Reversible change with bounded downside — just ship. (2) Insufficient traffic for adequate power. (3) Unit you'd randomize on would contaminate (use switchback or quasi-experiment). (4) Required by compliance/contract — no kill option. (5) Metric too long-horizon to measure in test window — use a leading indicator and triangulate.
Drill protocol
Enable drill mode. Read each question. Set a 90-second timer. Speak your answer out loud — out loud, not in your head. Reveal. Compare. Mark "practiced." Aim for 20+ of the 30 before any onsite. Reread the strong-answer phrasing for the ones you stumbled on.