Fraud & Imbalanced Data
The realities of fraud modeling — class imbalance at 1% or lower, delayed labels, threshold tuning at scale, and adversaries that evolve. fraud-domain-flavored throughout.
The shape of fraud data
Modeling fraud is structurally different from modeling churn or conversion. Three properties to internalize:
- Severe class imbalance. 0.1%–2% positive rate is typical. A naive model can score 99% accuracy by predicting "all clean" — useless.
- Delayed and partial labels. Fraud isn't always confirmed at decision time. A chargeback can land 60 days later. A SAR can be filed months after. The label set you train on lags reality.
- Adversarial. Fraudsters adapt to your model. Whatever feature catches them today, they evade tomorrow. Static evaluation under-states drift.
Tactics for imbalance
Class weights
Tell the loss function to penalize errors on the minority class more. Cheap and usually first-pass effective.
from lightgbm import LGBMClassifier
# 'balanced' sets weights inversely to class frequency
model = LGBMClassifier(class_weight='balanced')
# or: scale_pos_weight = neg/pos for explicit control
model = LGBMClassifier(scale_pos_weight=99) # if 1% positives
Sample weights
Per-row weights — useful when imbalance isn't just by class but by sub-population (some fraud types are rarer than others and worth more to catch).
Undersample the majority
Drop most of the negatives so positives are 5–10% of training. Faster training, sometimes better calibration after recalibration. Watch: undersampling biases probabilities — you must recalibrate to the true rate before serving.
SMOTE and synthetic oversampling
Generate synthetic minority examples in feature space. Famous, frequently misused. Two cautions:
- SMOTE on categorical / mixed-type features (typical for fraud) produces nonsense synthetic rows. Use SMOTENC for mixed types, but the benefit is smaller.
- SMOTE outside cross-validation folds leaks. Always synthesize inside each fold.
SMOTE is what intermediate ML candidates reach for. Senior practitioners reach for class weights, threshold tuning, and cost-sensitive learning first — they're simpler, equally effective for tabular fraud, and don't have the leakage risks. "SMOTE" as a knee-jerk answer is a small negative signal.
Anomaly detection as a complement
For very rare fraud types (< 0.1%) or "novel" fraud you have no labels for, semi-supervised anomaly detection (isolation forest, autoencoders) can flag candidates for review even when the supervised model misses them. Run in parallel, combine scores.
Threshold tuning
The default 0.5 threshold is almost never right for fraud. Pick the threshold based on operational constraints:
Fixed-capacity tuning
"Risk ops can review the top 1000 flagged applications per day." Threshold = the score below which fewer than 1000/day are flagged. Track recall at that threshold.
Fixed-precision tuning
"We need precision ≥ 80% — false positives are expensive." Find the highest threshold that keeps precision above the line, report recall.
Fixed false-positive-rate tuning
"We can tolerate flagging 0.1% of legitimate applications." Threshold = the value that holds FPR at 0.1%.
Cost-optimal tuning
If you can dollarize the cost of FP and FN, pick the threshold minimizing expected cost. The most defensible framing in interviews.
import numpy as np
from sklearn.metrics import precision_score, recall_score
def sweep_thresholds(y_true, y_scores, n=100):
cuts = np.linspace(0.01, 0.99, n)
results = []
for c in cuts:
y_pred = (y_scores >= c).astype(int)
results.append({
'threshold': c,
'precision': precision_score(y_true, y_pred, zero_division=0),
'recall': recall_score(y_true, y_pred),
'flagged_rate': y_pred.mean(),
})
return results
Cost-sensitive learning
The honest framing for fraud. Treat each error as having a dollar cost — false positive costs $X (lost legitimate transaction), false negative costs $Y (fraud loss). Optimize expected cost rather than balanced accuracy.
Two implementations
- Per-row sample weights proportional to dollar value at risk. A $50k loan application has 50× the training weight of a $1k application — and 50× the cost of a wrong prediction.
- Bayes-decision threshold. After training a calibrated probabilistic model, the cost-minimizing threshold is the one where the expected cost of acting (precision × FP-cost) equals the expected cost of not acting (1 − precision × FN-cost). Solve for it directly.
Label scarcity & delay
Fraud labels aren't all there at training time. Three patterns to handle:
Truncated labels
Recent applications might still become fraudulent (chargeback in 30 days, SAR in 60). Treating them as "not fraud" at training time is wrong. Fix: exclude rows from the most recent N days, or model the time-to-event explicitly.
Selection bias from manual review
The applications your team manually reviewed and flagged are not a random sample — they're the ones suspicious enough to review. Labels on the un-reviewed population are essentially missing. Methods: inverse-propensity weighting (review-propensity model), bandit-style exploration on a random small fraction of low-risk applications to gather labels.
Noisy labels
"Fraud" is often a judgment call by an investigator. Inter-rater agreement on borderline cases is often 70–80%. Robust losses (focal loss), label smoothing, or explicit uncertainty in the loss function help.
Evaluation that holds in production
Most "great offline AUC, mediocre production" cases are due to evaluation methodology, not modeling. The discipline:
- Time-based splits: train on the past, validate on the next month, test on the month after. Mirrors deployment.
- Group-aware splits: entities don't repeat across folds.
- Realistic label maturity: don't validate on data where labels are still arriving.
- Report at operational threshold: AUC is a summary; the deployed system runs at a single threshold. Report metrics there.
- Stratify reporting: by product, geography, segment. Average lift masks segment-level regressions.
Adversarial drift
Fraudsters are an adversary. Things to remember:
- A feature that works today may stop working when fraudsters notice. Build features that are hard to evade: derived from joint behavior across multiple signals, not single attributes.
- Retraining cadence matters more for fraud than for static problems. Weekly to monthly is typical.
- Monitor calibration drift, not just AUC drift. Calibration breaks first when fraudsters mimic legitimate patterns.
- Keep a fraction of decisions uncorrelated with the model (random review) to detect novel patterns the model hasn't learned.
Interview probes
Show probe 1: "Class imbalance is 99/1. What do you do?"
Don't reach for SMOTE first. Start with: (1) class weights or scale_pos_weight in the model, (2) report PR curve and lift at K%, not accuracy, (3) tune threshold to the operational constraint (capacity, precision, FPR, or cost). SMOTE only if the simpler approaches don't clear the bar, and even then with care about leakage (synthesize inside CV folds) and mixed types (use SMOTENC).
Show probe 2: "How would you pick a threshold for fraud?"
Four options. (1) Fixed-capacity — match risk ops' review bandwidth. (2) Fixed-precision — false positives are expensive, pick the threshold that holds precision above the line. (3) Fixed-FPR — tolerable rate of flagging legitimate apps. (4) Cost-optimal — dollarize FP and FN costs, minimize expected cost. The last is the most defensible but requires a real cost model; the first is the most operational. Default 0.5 is never the right answer.
Show probe 3: "Some of your recent applications might still chargeback. How does that affect training?"
Truncated labels — labeling them as 'not fraud' is wrong because labels are still arriving. Two fixes. (1) Exclude the most recent N days from training, where N is the typical label maturity window. (2) Model time-to-event explicitly with survival analysis, where unresolved cases are censored, not labeled negative. The first is the simpler default; the second is staff bar when label maturity is highly variable.
Show probe 4: "Your model used to catch ring X. It's stopped. Why?"
Four hypotheses. (1) Ring evolved — they noticed a feature that flagged them and changed behavior. Verify by looking at recent caught examples from that ring vs current evaders. (2) Drift in unrelated features made the model less sensitive — check feature drift and importance over time. (3) Label drift — definition of what counts as ring X changed in the labeling process. (4) The ring genuinely retired. Investigate by examining recent escapes by hand.
Show probe 5: "Why keep a random review fraction even when the model is good?"
To detect novel patterns the model hasn't learned. If you only review what the model flags, you only label what looks like known fraud — and you miss new fraud types entirely. A small random sample (1–5%) of low-risk applications goes to manual review purely as exploration. Catches new patterns before they scale, builds labels for retraining.