Section B · Core DS

Feature Engineering

Such JDs say "inventive feature engineering" drives competitive advantage. Here's what that actually means — entity features, rate/velocity, graph signals, leakage avoidance, drift, and the loop that turns domain insight into model lift.

Why this chapter is the longest in the guide

the JD is unusually direct about where they think the work lives: "success by researching / developing through iteration, integration of new data sources and inventive feature engineering." And: "unusual insights drive our competitive advantage rather than optimization of new machine learning methodologies."

For full-stack applied DS roles, the model is roughly a commodity. The feature set is the moat. This chapter is the longest because it's the differentiating skill — and the most likely deep-dive in a staff loop.

Entity-level features

The most useful features in fraud (and many DS domains) describe an entity — a user, a device, an IP, an SSN, an email — and aggregate that entity's history. Three patterns:

1. Cumulative counts

"How many times has this device been seen?" "How many distinct SSNs has this email been associated with?" Cardinality features are extraordinarily predictive in fraud — a device associated with 50 SSNs is suspicious without further evidence.

2. Temporal aggregates over a window

"How many applications has this SSN been used for in the last 24 hours / 7 days / 30 days?" Time-windowed counts catch velocity attacks.

3. Cross-entity associations

"How many SSNs share a phone number with this one?" Cross-entity counts reveal synthetic-identity rings.

entity feature pattern (SQL)

-- Velocity: distinct SSNs touched by this device in trailing 7 days
SELECT
  a.application_id,
  a.device_fingerprint,
  COUNT(DISTINCT b.ssn) AS distinct_ssns_7d
FROM applications a
LEFT JOIN applications b
  ON b.device_fingerprint = a.device_fingerprint
 AND b.applied_at BETWEEN a.applied_at - INTERVAL '7 days' AND a.applied_at
GROUP BY a.application_id, a.device_fingerprint;

Temporal trap

Always window features strictly before the prediction timestamp. A feature like "distinct SSNs ever associated with this device" includes the future relative to historical applications and leaks. Use trailing windows only.

Rate & velocity features

Rate features (events per unit time) and velocity features (rate-of-change) are central to fraud detection. Bursts of activity at a single entity are signal.

Applications per device per minute / hour / day.
Address changes per account per month.
Time-since-last-event per identity.
Ratio of high-risk transaction types to total.

Choosing windows: ladders of (1m, 5m, 1h, 24h, 7d, 30d) cover most attack patterns. Too granular wastes feature budget; too coarse misses bursts.

Graph & network features

Fraud is fundamentally a network problem. Treat entities as nodes, shared attributes as edges, and you can derive features like:

Degree: how many other entities is this one connected to?
Component size: how big is the connected component this identity belongs to?
Distance to known fraud: shortest path from this entity to a labeled fraudulent one.
Density of fraud in N-hop neighborhood: what fraction of neighbors are flagged?

Why this is high-leverage at the fraud/identity company

Synthetic-identity fraud rings share data points (phones, addresses, devices) precisely because they're constructed by the same actor. Graph features expose that structure even when individual entities look clean.

Leakage

The most common source of "great model, terrible production performance." Three kinds:

Temporal leakage

Features that look back across the prediction timestamp into the future. Example: a feature "total transactions ever for this user" — at training time, "ever" includes data after the label was assigned. Fix: define features with explicit "as-of" timestamps that match prediction time.

Target leakage

Features that include the label or a near-deterministic proxy. Example: in fraud, the "review status" column from a downstream investigator is sometimes added to the training table — but that label is set after the prediction is made and is downstream of it. Fix: every feature column needs an explicit "what knowledge was available at prediction time" answer.

Group leakage

Same entity (user, household, device) appears in train and validation, and the model memorizes the entity rather than learning the pattern. Fix: GroupKFold on the entity ID.

The "too good to be true" smell

If your model lifts AUC from 0.85 to 0.98 with one new feature, suspect leakage before celebrating. Verify the feature is computable at prediction time, with only data that was available then.

Drift

Features drift. So does the target relationship. Three flavors:

Covariate drift: feature distribution changes (e.g., new device types, new geographies).
Concept drift: the relationship between features and target changes (fraud patterns evolve to evade the model).
Label drift: the base rate or label definition changes (a new fraud type is added to the label, or a tagging policy changes).

Monitoring

PSI (Population Stability Index) per feature, in production.
Kolmogorov-Smirnov on prediction distribution.
Calibration check when labels arrive — if 0.3 used to mean 30% positive rate and now means 22%, the model is drifting.
Performance metrics (AUC, lift at K%) on a rolling window of labeled data.

Feature stores

A feature store keeps the training-time and serving-time feature computation logic identical, with features pre-computed and cached for low-latency serving. Examples: Feast, Tecton, Vertex AI Feature Store, in-house variants.

You don't need one for a small team. You need one when:

Multiple models share features and you can't afford definitional drift between them.
Latency budget requires precomputed features (most real-time fraud / lending applications).
You need point-in-time correct training data and serving features have to match.

Domain insight: the actual differentiator

Inventive feature engineering isn't "tried 50 polynomial features and grid-searched." It's noticing a structural pattern in fraud behavior that a generic feature library wouldn't surface, and encoding it.

Where domain features come from

Reading post-mortems on caught fraud rings — what was the giveaway? Encode that.
Shadowing investigators on the fraud or risk-ops team — what do they actually look at? That signal probably isn't in your features yet.
Reading the methodology of competitors publicly — fraud is an arms race and patterns generalize.
Doing error analysis on your model's wrong predictions — what feature does the analyst use to override that the model doesn't have?

Example pattern

A senior DS at a lender notices that synthetic identities tend to apply with addresses that geocode to a residential unit but whose Google Maps Street View shows a commercial building. That observation becomes a feature: "address-type-mismatch flag." That feature wouldn't exist in any auto-generated feature library.

The story to tell in interviews

For staff DS, lead with a feature you discovered through domain investigation, not a model architecture choice. "I noticed that fraud applications used credit reports pulled the same day as the application 90% of the time, vs 30% in legitimate apps — I built a 'application-to-credit-pull-time-gap' feature that moved AUC from 0.86 to 0.89. The insight wasn't from a search; it came from shadowing the investigations team for a week." That's the fraud-domain-flavored answer.

Interview probes

Show probe 1: "Give me an example of a high-leverage feature you've built."

Have one ready. Frame: the domain observation, the feature, the lift it produced, the leakage check you did. The lift number matters less than naming the observation that led to it — that's what the fraud/identity company is testing for.

Show probe 2: "How do you check for leakage?"

For every feature: ask "what data was available at prediction time?" and trace the feature's logic. Specifically: temporal leakage (features looking forward), target leakage (features that depend on the label or downstream of it), group leakage (same entity in train + validation). Empirical signal: implausibly high single-feature lift (AUC jumps 10+ points from one new feature is almost always leakage).

Show probe 3: "Your model's AUC is great on validation but production performance is worse. What do you investigate?"

Four hypotheses. (1) Temporal leakage in validation — random k-fold instead of time-based. (2) Covariate drift — production data has a different feature distribution than training. (3) Concept drift — the relationship has changed; fraudsters adapted. (4) Label drift — what counts as 'fraud' in production labels is different from training labels. Diagnose with: time-based holdout reproducing the gap, PSI per feature, calibration check on recent labels, label-policy audit.

Show probe 4: "How would you design features for synthetic identity detection?"

Layered. (1) Entity cardinality: how many distinct names/emails/phones associated with this SSN historically? (2) Velocity: applications per SSN/device in trailing 1d/7d/30d. (3) Graph proximity: distance from this identity to known fraud in a shared-attribute graph. (4) Consistency: does the name match the SSN's plausible age (issued before applicant was born)? Is the address a residential geocode? (5) Behavioral: typing/click patterns on the application form, if logged.

Show probe 5: "When would you use a feature store?"

When training-time and serving-time feature computation has to be identical (point-in-time correctness), or when multiple models share features and definitional drift would cause confusion. For a single model with batch serving, a feature store is overhead — a well-structured dbt pipeline plus a serving query is enough. Reach for Feast / Tecton when latency, sharing, or freshness force the issue.