Production ML
The operational craft a staff DS at the fraud/identity company is expected to bring — production code quality, real-time inference patterns, monitoring, retraining, rollouts, and on-call.
The "end-to-end ownership" bar
the JD: "This is a full-stack data science role, involving model development, analysis, and writing production code." And: "Write production-ready code that can be relied on for real-time decision making by our partners."
What that means in interviews: you'll be asked about how your code looks, how it's tested, how it deploys, how it's monitored, and what happens when it breaks. Not just "show me an AUC curve."
Production code quality
The dividing line between research code and production code:
Modules, not notebooks
Code lives in .py files, organized into modules. Notebooks are exploration; the deliverable is importable code.
Types
Type hints everywhere. Modern Python: list[int], dict[str, float], Optional[X]. mypy or pyright in CI.
Small functions
One responsibility per function. Pure where possible. Mocking and testing become easy when functions don't have hidden dependencies.
Configuration over hard-coded values
Thresholds, model paths, feature flags — externalize. Read from environment or a config file. Don't edit the model_path = "..." line every deploy.
Error handling
Be explicit about what fails how. Validate inputs at module boundaries; trust within the module. Don't catch broadly when you should let the failure propagate.
Logging
Structured logging (JSON) at appropriate levels. INFO for normal operation, WARNING for unusual but handled, ERROR for failures that need attention. Include correlation IDs so a single request can be traced across services.
import logging
from dataclasses import dataclass
from typing import Optional
logger = logging.getLogger(__name__)
@dataclass(frozen=True)
class ScoringResult:
score: float
decision: str
model_version: str
features_used: dict[str, float]
correlation_id: str
class Scorer:
def __init__(self, model, threshold: float, version: str):
self.model = model
self.threshold = threshold
self.version = version
def score(self, features: dict[str, float], correlation_id: str) -> ScoringResult:
if not features:
raise ValueError("empty features dict")
probs = self.model.predict_proba([list(features.values())])[0]
score = float(probs[1])
decision = "flag" if score >= self.threshold else "pass"
logger.info("scored", extra={
"correlation_id": correlation_id,
"score": score,
"decision": decision,
"model_version": self.version,
})
return ScoringResult(score, decision, self.version, features, correlation_id)
Testing
Production ML needs three layers of tests:
Unit tests
Each function tested in isolation. Feature transformations, scoring logic, threshold decisions. Use pytest. Aim for high coverage on critical paths.
Integration tests
The scoring pipeline end-to-end on a known input. "Given this raw application, the service produces this score." Catches issues at module boundaries that unit tests miss.
Model-quality regression tests
Special to ML. Hold a frozen labeled test set. On every model change, score the test set and require AUC (or lift at K%) to be at least X. Prevents accidental quality regressions from deploy.
"I wouldn't ship a model change without the regression test set passing" is the line that separates a researcher from a production engineer. Mention it unprompted.
Real-time inference
the company's APIs run in the milliseconds. Patterns that matter:
Feature precomputation
For features that can't be computed in the request budget (cross-entity aggregates, graph features), precompute and cache. The request looks up the cached value.
Model serialization
Don't load Pickled models per request. Load once at service startup; keep the model in memory; reuse across requests.
Batching
When latency permits, batch multiple requests into one model call. Gradient boosting and neural network inference benefit substantially from batching.
Cold-path vs hot-path
Some features need synchronous lookup (call a bureau API). Some don't (lookup in a precomputed feature store). Separate the two paths; budget the latency accordingly.
Fallbacks
If a downstream feature service times out, return a conservative default rather than failing the request. The decision has to happen; missing a feature isn't a reason to deny the application.
Monitoring
Four layers of monitoring for a deployed model:
1. Service health
Latency p50/p95/p99, error rate, throughput. SRE-style metrics, alertable.
2. Input drift
PSI per feature in a rolling window. Alert when distribution shifts beyond a threshold. Catches upstream data issues fast.
3. Output drift
Score distribution shifts. Mean, median, P95, fraction above the threshold. Alert when these drift past historical bands.
4. Outcome metrics
When labels arrive (chargebacks, manual reviews), score the model on the new labels and confirm AUC, calibration, and lift haven't degraded. This is the slow but ground-truth signal.
The dashboard
A senior DS owns the model's monitoring dashboard. It shows all four layers on one page, with thresholds and recent alerts. Stakeholders can glance at it and see model health.
Retraining cadence
How often to retrain depends on the rate of drift:
- Stable problems (legal-text classification, equipment-failure prediction): quarterly retrain is plenty.
- Moderate drift (lending propensity, churn): monthly.
- High drift / adversarial (fraud detection): weekly to bi-weekly. fraud-detection-scale.
- Continuous: online learning with frequent micro-updates. Rare in production fraud because validation rigor is harder; usually a batch retrain on a fast cycle is sufficient.
Triggered retraining
Beyond cadence, retrain when monitoring triggers fire: significant input drift, output drift, outcome regression. Build the path for triggered retraining alongside the scheduled one.
Rollouts & canaries
A new model isn't a deploy you flip in one shot. Stages:
Shadow mode
New model scores every request alongside the current model. Decisions still made by the current model. Compare new scores to old; check for unexpected behavior. Run for days to weeks.
Canary
New model makes decisions on a small slice (1–5%) of traffic, in addition to its shadow scoring on the rest. Monitor outcome metrics on the canary. Promote to full rollout if metrics hold.
A/B rollout
50/50 split. Compare outcome metrics. Promote winner. Useful when you want a clean lift estimate, not just a quality regression check.
Full rollout
New model is the decision-maker. Old model kept in shadow for a while for comparison and easy revert.
On-call considerations
A production ML service is on the on-call rotation if it's serving live decisions. The senior DS owns:
- Runbooks: when X alert fires, do Y. Covers common failures (input feature stale, model not loaded, downstream timeout).
- Rollback procedure: how to revert to the previous model version in under a minute.
- Escalation: who do you wake up at 3 AM? When?
- Post-mortems: every incident gets one. Root cause, contributing factors, action items, and (critically) what evaluation gap let the issue ship.
Interview probes
Show probe 1: "Walk me through deploying a new fraud model."
Stages: (1) Model-quality regression test on the frozen labeled set must pass. (2) Shadow mode against production traffic for 1–2 weeks; compare score distribution and feature usage against the existing model. (3) Canary on 1–5% of traffic; monitor outcome metrics until labels arrive. (4) A/B or full rollout if outcomes hold. (5) Keep old model in shadow for fast revert. Plus runbooks and on-call handoff at each stage.
Show probe 2: "What do you monitor for a deployed model?"
Four layers. (1) Service health — latency p95/p99, error rate, throughput. (2) Input drift — PSI per feature in a rolling window. (3) Output drift — score distribution shifts. (4) Outcome metrics — AUC, calibration, lift on a rolling labeled window, once labels arrive. Senior signal: naming all four; many candidates name only the first three.
Show probe 3: "Your model's calibration is drifting. What do you do?"
(1) Confirm drift is real (not a small sample artifact). (2) Decide: recalibrate or retrain? Recalibration on recent labels is faster and reversible; retraining addresses underlying changes in feature-target relationship. (3) Run a temporary recalibrator (isotonic on the latest labeled window) to bridge while preparing a full retrain. (4) Communicate to consumers — calibration changes affect downstream thresholds.
Show probe 4: "How do you test ML code?"
Three layers. Unit tests on transformation, scoring, threshold functions. Integration tests on the end-to-end scoring pipeline with known inputs and expected outputs. Model-quality regression tests — a frozen labeled set with a minimum AUC or lift threshold that must pass before merging a model change. The third is what separates production ML from research ML.
Show probe 5: "Sub-100ms latency for fraud scoring. How do you achieve it?"
Precompute expensive features (cross-entity aggregates, graph features) into a feature store keyed for fast lookup. Model loaded once at service startup, kept in memory. Lightweight serialization (e.g., LightGBM's native format, not Pickle). No synchronous calls to slow downstreams in the hot path — if you must call a bureau, do it async with a tight timeout and a fallback. Profile the path; usually the model itself is <5ms; the rest is feature lookup and serialization.