MLOps Applied
The operational glue around production models — CI/CD, experiment tracking, model registry, shadow deploys, canaries, drift monitoring, retraining triggers. The framing that separates "I built a model" from "I shipped one."
Why MLOps belongs to DS here
At companies with separate ML platform teams, much of this is owned by ML engineers. In full-stack DS roles, the DS owns it. JD signals: fraud-domain JDs say "writing production code"; multimodal-AI JDs say "iterative preprocessing cycles." Both imply the DS is responsible for the model continuing to work in production, not just being trained once.
CI/CD for models
A model PR should trigger:
- Code linting + type checks (mypy, ruff). Standard SWE hygiene.
- Unit tests on transformations and scoring functions.
- Integration tests on the end-to-end scoring path.
- Model-quality regression test: train (or load) the new model, score the frozen test set, assert metrics above threshold.
- Feature parity test: features computed from raw events match features used in training.
- Deployment to staging: model serves on a staging environment; smoke tests pass.
- Manual approval gate for production deploys of model versions.
The whole pipeline should run in under 30 minutes — slow CI kills iteration speed.
Experiment tracking
For every training run, record:
- Training data hash and row count.
- Feature set version.
- Hyperparameters.
- Code commit hash.
- Evaluation metrics on a fixed holdout.
- Artifacts: serialized model, calibrator, eval-set predictions.
Tools: MLflow, Weights & Biases, Neptune, or a homegrown record on S3. Pick one. The cost of not doing this shows up six months later when nobody can reproduce the model in production.
Model registry
A central place where named, versioned models live with metadata. "fraud_v2.3, stage: Production" is what the serving service loads. Promotion stages: None → Staging → Production → Archived.
The promotion is a deliberate operation:
- Training writes to
None. - Passing CI promotes to
Staging. - Passing canary promotes to
Production. - Previous Production becomes
Archivedbut stays loadable for fast revert.
Shadow deploys
The new model serves on every request alongside the existing model. Both compute scores; the existing model's decision is what's used. Log both scores. Useful for:
- Verifying the new model's score distribution matches what you saw in offline evaluation.
- Catching scoring service issues (memory, latency, errors) before users see them.
- Comparing decisions: "of the ones the new model flags, what fraction does the existing model also flag? What's the disagreement profile?"
Typical duration: 1–2 weeks. Cost: roughly 2× inference compute during shadow.
Canary rollouts
After shadow, the new model serves real decisions for a small percentage of traffic (1–5%). The rest of traffic stays on the existing model. Compare outcome metrics on the canary versus the control — when labels arrive, you have a small A/B test of model quality.
Decision rule:
- If canary outcomes track or beat control → promote to full rollout.
- If canary outcomes regress materially → roll back, investigate, iterate.
- If canary outcomes are inconclusive (small sample, slow labels) → continue at 1–5% for longer.
Drift & performance monitoring
Detailed in 09-production-ml. The four layers, recap:
- Service health: latency, error rate, throughput.
- Input drift: PSI per feature.
- Output drift: score-distribution stability.
- Outcome drift: AUC / calibration / lift on a rolling labeled window.
The MLOps angle: each layer needs an owner, an alerting threshold, and a runbook. Without those, the dashboard exists but nobody acts on it.
Retraining triggers
Two trigger patterns, often combined:
Scheduled
Retrain on a cadence (weekly, monthly, quarterly) regardless of drift signals. Predictable; ensures fresh labels are incorporated.
Triggered
Retrain when monitoring fires:
- Significant input drift (PSI > 0.25 on a meaningful feature).
- Output drift past historical bands.
- Outcome metrics regression (AUC drop > 0.02, calibration shift).
- Fresh ground-truth labels surpass a threshold (X% new labels since last retrain).
A drift trigger says "the world has changed." But "retrain on the new data" only helps if the new labels reflect the new world. In fraud, labels often lag — retraining immediately on drift produces a model trained mostly on old labels with new feature distributions, which can be worse than the existing model. The senior judgment is when to wait for fresh labels vs ship a recalibration as a stopgap.
Interview probes
Show probe 1: "Walk me through deploying a new model from training to production."
Train → CI runs (lint, types, unit, integration, model-quality regression test). Register in model registry as Staging. Deploy to staging environment, smoke tests pass. Run in shadow against production traffic for 1–2 weeks, comparing score distributions and disagreement profile against the existing model. If clean, canary at 1–5% real-decision traffic. Monitor outcome metrics. If those track or beat control, promote to full rollout. Previous Production archived for fast revert.
Show probe 2: "Why shadow before canary?"
Shadow lets me verify the new model's behavior on real production data without affecting decisions. Catches operational issues (latency, errors, memory) and distribution surprises (the offline holdout didn't match production). Canary then validates outcome quality on a small slice — once I know the service works, I want to learn whether decisions are good.
Show probe 3: "How do you handle drift alerts at 3 AM?"
Runbook-driven. Page-worthy alerts are reserved for outcome-metric regressions and service health (latency, errors). Input drift is usually a non-paging Slack alert that's investigated next business day — drift over a couple of hours rarely justifies a 3 AM page. Output drift can be paging if it's severe; investigate, decide whether to revert to the previous model, file a follow-up for next business day. The senior move is having the page criteria documented so the on-call DS isn't making the call at 3 AM.
Show probe 4: "What goes in a model registry vs an experiment tracker?"
Experiment tracker logs every training run — hyperparameters, metrics, artifacts. Most are never deployed. Model registry holds the named, versioned models that are or might be in production, with promotion stages. Every model in the registry came from a tracked experiment, but most experiments don't make it into the registry. The registry is the source of truth for "what's serving."
Show probe 5: "When do you retrain?"
Two patterns, combined. (1) Scheduled cadence — weekly for adversarial fraud problems, monthly for moderate drift, quarterly for stable domains. (2) Triggered by monitoring — significant drift in inputs, outputs, or outcomes; or accumulation of fresh labels above a threshold. The judgment is on the trigger side: don't retrain blindly on drift; verify there are fresh labels reflecting the new world, otherwise you're training on old labels with new features and may make things worse.