Drill — Answers Hidden by Default

Practice Interview Questions

30 questions across 8 categories, calibrated to full-stack / applied DS loops at fraud/identity and multimodal-sensor AI companies. 90 seconds per answer.

🎯 30 questions ⏱ ~90 sec each 📊 Progress saved
0 / 30 practiced

Section A · Background / motivation

Q1. "Walk me through your background."

Show answer

Edge → one specific project → bridge to this role anchored to a specific JD line → honest gap with a closing plan. See 02-positioning-from-scratch.

Q2. "Tell me about a time you owned a model end-to-end."

Show answer

Frame: business question, framing decision (target, unit, eval criterion), data acquisition, features (lead with the domain-insight feature), model, evaluation (calibration, lift, business utility), production (where, how monitored), post-launch (what surprised you, what you'd do differently). Lead with the feature insight — that's the fraud-domain-flavored answer.

Q3. "Why this role / company?"

Show answer

Point at one specific line in the JD. For a fraud/identity Staff-DS role: the end-to-end ownership + domain-insight-driven approach. For a multimodal-AI role: the iteration speed + multimodal sensor work. Be specific or it sounds generic.

Q4. "Tell me about a model that was wrong."

Show answer

Have a real one. Frame: what shipped, what was wrong, how you discovered it, what you did, what changed in your process going forward. The last beat is the differentiator.

Section B · ML & evaluation

Q5. "When would you use logistic regression over gradient boosting?"

Show answer

(1) Interpretability requirement (regulators, fair-lending review). (2) Small sample (boosting overfits with few hundred rows). (3) Genuinely linear/additive relationship after good features. (4) Extreme latency constraint where tree-ensemble inference is too slow. Otherwise default to gradient boosting on tabular.

Q6. "Why calibrate a classifier?"

Show answer

AUC measures ranking only. If downstream consumers use the probability for math (combining with cost, applying a threshold), uncalibrated scores produce silently wrong decisions. Isotonic on a held-out set. Reliability diagram as the diagnostic.

Q7. "Random k-fold on a fraud dataset — what goes wrong?"

Show answer

Two leaks. Temporal: fraud patterns evolve, random folds let future patterns inform training. Use TimeSeriesSplit. Identity: same entity appears in many rows; random folds let the model memorize entities. Use GroupKFold. Combine both for fraud-style data.

Q8. "How do you pick a threshold for binary classification?"

Show answer

Depends on operational constraint or cost asymmetry. Options: fixed-capacity (review bandwidth), fixed-precision (FP tolerance), fixed-FPR, or cost-optimal (dollarized FP/FN). Default 0.5 has no operational meaning.

Section C · Features

Q9. "Give me an example of an inventive feature you've built."

Show answer

Have one ready. Frame: the domain observation, the feature, the lift, the leakage check. fraud-domain-flavored answers lead with the observation, not the lift number.

Q10. "How do you check for leakage?"

Show answer

For each feature, ask "what data was available at prediction time?" — trace the logic. Watch: temporal (features looking forward), target (features that depend on label), group (same entity in train + validation). Empirical: implausibly large single-feature lift is almost always leakage.

Q11. "Design features for synthetic-identity detection."

Show answer

Layered. Entity cardinality (distinct names per SSN, etc.); velocity (apps per SSN per 24h); graph proximity (distance to known fraud in shared-attribute graph); consistency (age vs SSN issuance year, address-type sanity); behavioral (typing patterns, paste patterns on the application form).

Q12. "When would you use a feature store?"

Show answer

Point-in-time correctness across many models; shared features across models with no drift between them; real-time serving where features must be precomputed. For a single batch model, a well-structured dbt pipeline plus a serving query is enough.

Section D · Imbalanced data & fraud

Q13. "Class imbalance is 99/1. What do you do?"

Show answer

Class weights, PR-curve + lift-at-K reporting, threshold tuning to the operational constraint. SMOTE only if simpler approaches don't clear the bar, and with leakage care (synthesize inside CV) and mixed-type caveats (use SMOTENC).

Q14. "Why not 'accuracy' for fraud models?"

Show answer

With 1% positive rate, predicting all-negative gets 99% accuracy. Useless. The decision-relevant metrics are PR-curve, lift at top K%, recall at fixed precision (or vice versa), and cost-weighted expected loss if you can dollarize.

Q15. "Labels are delayed by 60 days. How does that affect training?"

Show answer

Truncated labels — recent applications might still become fraud. Options: exclude the most recent 60 days from training, or use survival-style modeling with censoring. The first is the simpler default; the second is staff bar when label maturity is highly variable.

Q16. "Why keep a random-review fraction?"

Show answer

To detect novel fraud the model hasn't learned. Without it, you only label what the model flagged — and you miss new patterns entirely. A small (1–5%) random sample of low-risk apps goes to manual review purely as exploration.

Section E · Prompts & signals (multimodal-AI)

Q17. "How do you evaluate a prompt rigorously?"

Show answer

Build an eval set (20–50 labeled examples) before iterating. Score with programmatic checks for structured output, LLM-as-judge calibrated against humans for open-ended. Track per-example deltas, not just aggregates. One change at a time.

Q18. "How many few-shot examples?"

Show answer

3–8 for most tasks. Pick by coverage (range of valid inputs), include edge cases that confused earlier prompts, put the most representative example last (recency weights heavier).

Q19. "When to fine-tune over prompt?"

Show answer

Stable task definition, prompting hit a quality ceiling, ≥1k high-quality examples, or always-on prompt overhead eating context. Avoid while requirements are still moving — fine-tuning calcifies behavior.

Q20. "How would you prep a customer's video + sensor stream for analysis?"

Show answer

Frame extraction rate matched to dynamics. Align clocks via a shared landmark + cross-correlation. Clean sensors (dedupe, sort, winsorize, flag stuck). Build a small eval set with customer's expected outputs. Then run small slice + iterate prompts.

Q21. "Why filtfilt vs causal filter?"

Show answer

filtfilt is zero-phase (offline only, uses future samples). Causal filter introduces phase delay (required for real-time). Pick by deployment — wrong choice either fakes great offline results or introduces lag in production.

Section F · Production & MLOps

Q22. "Sub-100ms fraud scoring. Sketch the architecture."

Show answer

EC2-hosted scoring service. Model loaded once in memory. Precomputed entity features in Redis/DynamoDB (sub-ms). RDS for live app data (~10ms with indexes). Third-party bureau calls async with tight timeout + fallback. Async logging to S3. Profile to confirm budget — model is <5ms; feature lookup dominates.

Q23. "Deploying a new model — walk me through it."

Show answer

CI (lint, types, unit, integration, model-quality regression test). Register as Staging. Smoke tests in staging. Shadow against production for 1–2 weeks. Canary at 1–5%. Monitor outcomes. Promote to full rollout if outcomes hold. Previous Production archived for revert.

Q24. "Calibration is drifting. What do you do?"

Show answer

Confirm real. Decide recalibrate (fast, reversible) vs retrain (addresses feature-target shift). Ship recalibrator on recent labels as stopgap. Plan retrain. Communicate to consumers — calibration changes affect their thresholds.

Q25. "Four layers of monitoring."

Show answer

Service health (latency p95/p99, error rate, throughput). Input drift (PSI per feature). Output drift (score distribution). Outcome metrics (AUC, calibration, lift on rolling labeled window). Senior signal: naming all four.

Section G · Domain

Q26. "What's synthetic identity fraud?"

Show answer

An identity manufactured from real and fake parts — often a real SSN with fabricated name/DOB. Hard because bureau records can look legitimate (slowly-built credit). Signals are structural: age vs SSN issuance, address sparsity, application velocity, behavioral anomalies.

Q27. "Why does fair lending care about fraud models?"

Show answer

Even though fraud models aren't credit decisions, they affect who gets credit. Disparate decline rates across protected classes are an ECOA/Reg B issue. Staff DS work includes disparate-impact analysis and identifying features that proxy for protected attributes (ZIP code is the classic example).

Q28. "What's hard about multimodal sensor AI vs single-modality?"

Show answer

Synchronization (different rates, clocks, reliabilities), fusion strategy (how to combine modalities meaningfully), and data heterogeneity (every deployment has different sensors and labels, making transfer hard). The strong practitioner reduces these to common scaffolding: standardized preprocessing, modality-agnostic features where possible, prompt-based composition over a foundation model.

Section H · Curveballs

Q29. "Your model's AUC is great but production performance is worse. Why?"

Show answer

Four hypotheses. Temporal leakage (random k-fold instead of time-based). Covariate drift (input distribution shifted). Concept drift (relationship changed). Label drift (label policy changed). Diagnose each in order; the fix is different for each.

Q30. "Customer brings ambiguous data and asks if you can model it. How do you respond?"

Show answer

(1) Frame the decision the model is meant to inform. (2) Quick feasibility scan: label availability, time-to-label, sensor quality, label-noise estimate. (3) Build the smallest baseline that gets to a result — n-shot prompt, simple aggregate, baseline classifier. (4) Score against an eval set the customer agrees on. (5) Decide: iterate, escalate, or politely decline. The senior move is not over-committing before scoping.

Drill protocol

How to drill

Enable drill mode. Read each question. 90-second timer. Speak the answer out loud — out loud, not in your head. Reveal. Compare. Mark "practiced." Aim for 20+ of 30 before any onsite. Reread strong-answer phrasing for the ones you stumbled on.