Section C · Coding

Coding Fundamentals

The Python patterns full-stack DS interviewers actually reach for — production-flavored code, testability, vectorization, NumPy / pandas fluency, and the Big-O intuition that wins live-coding rounds.

What to expect

For staff-level full-stack DS, coding rounds skew toward "production-flavored" problems: write a function that another engineer could review, handle inputs cleanly, and be ready to discuss extensions. Staff-DS JDs name "production code and tests" explicitly.

You should expect:

  • 30–60 minute live coding sessions.
  • Problems framed in domain language (fraud, scoring, features) rather than pure algorithms.
  • Follow-ups that add wrinkles — latency budget, streaming variant, scale requirement.
  • An expectation that your code includes types, small functions, and at least one example test.

Classes & testability

When the function has state (a model, a threshold, a feature pipeline), wrap it in a class with explicit dependencies. Makes mocking and testing easy.

testable scoring class
from typing import Protocol

class Model(Protocol):
    def predict_proba(self, X: list[list[float]]) -> list[list[float]]: ...

class Scorer:
    def __init__(self, model: Model, threshold: float):
        self.model = model
        self.threshold = threshold

    def score(self, features: list[float]) -> dict:
        prob = self.model.predict_proba([features])[0][1]
        return {"prob": prob, "decision": "flag" if prob >= self.threshold else "pass"}

# Test with a fake model — no need to load a real one
class FakeModel:
    def predict_proba(self, X):
        return [[0.4, 0.6] for _ in X]

def test_scorer_flags_above_threshold():
    s = Scorer(FakeModel(), threshold=0.5)
    assert s.score([1.0, 2.0])["decision"] == "flag"

The senior signal: writing the test alongside the class without being asked.

Generators & streaming

For data that doesn't fit in memory or arrives over time, use generators. They yield one item at a time and never load the whole stream.

streaming aggregation
from collections import defaultdict

def read_events(path):
    """Stream events from a large file one at a time."""
    with open(path) as f:
        for line in f:
            yield parse(line)

def aggregate_by_user(events):
    counts = defaultdict(int)
    for e in events:
        counts[e["user_id"]] += 1
    return counts

counts = aggregate_by_user(read_events("huge_log.jsonl"))

The interview tell: when the problem says "a large file," reach for generators. When the problem says "streaming," reach for generators or async iteration.

Vectorization vs loops

Loops over arrays / series are 10–100× slower than vectorized operations. For any data of meaningful size, vectorize.

slow vs vectorized
import numpy as np

x = np.random.rand(1_000_000)
y = np.random.rand(1_000_000)

# Slow — Python loop
result_slow = [a * b for a, b in zip(x, y)]

# Fast — vectorized
result_fast = x * y

# Conditional logic: np.where, not Python if/else
flags = np.where(x > 0.95, "high", np.where(x > 0.5, "mid", "low"))

NumPy patterns

Broadcasting

Operations between arrays of different shapes follow rules — the smaller is broadcast over the larger. Master this and a lot of code shrinks.

broadcasting
import numpy as np

# Per-feature standardization without loops
X = np.random.randn(1000, 20)        # 1000 rows, 20 features
mean = X.mean(axis=0)                # shape (20,)
std = X.std(axis=0)                  # shape (20,)
X_normalized = (X - mean) / std      # broadcast (1000, 20) - (20,)

Indexing tricks

  • Boolean masking: X[mask] selects rows where mask is True.
  • Fancy indexing: X[indices] selects rows in arbitrary order.
  • argsort / argpartition: get indices for sorting, no need to sort the actual array.

Pandas patterns

Group-by-aggregate

groupby patterns
import pandas as pd

# Multiple aggregations per group
df.groupby('user_id').agg(
    n_events=('event_id', 'count'),
    total_spend=('amount', 'sum'),
    first_event=('event_at', 'min'),
)

# Window functions
df['cumulative_spend'] = df.groupby('user_id')['amount'].cumsum()
df['rank_within_user'] = df.groupby('user_id')['event_at'].rank('dense')
df['days_since_prev'] = df.groupby('user_id')['event_at'].diff().dt.days

Merge / join

Pandas defaults to inner join. Always specify how. Always check row counts before and after a merge — silent row duplication is a common bug.

The asof merge

For "what was the value of X at the time of event Y" — standard time-series alignment problem. Use pd.merge_asof with direction='backward' and a tolerance.

Pandas at staff bar

Use pandas for analysis and prototyping. For production paths that need to scale past a few GB or run in real-time, either push to SQL or use a more performant alternative (Polars, DuckDB). Knowing when to leave pandas is part of the staff signal.

Testing & reproducibility

  • Pytest for unit and integration tests.
  • Fixtures for shared setup. Don't duplicate the model-loading boilerplate across tests.
  • Parameterize tests over cases. One test, many cases, clean output on failure.
  • Set random seeds wherever a model fit, sample, or train/test split happens. Reproducibility for free.

Big-O cheat sheet

OperationComplexity
Dict / set lookup, insertO(1) amortized
SortO(n log n)
Heap push/popO(log n)
Top-K via heapO(n log K)
NumPy vectorized opO(n), constant ~100× smaller than Python loop
Pandas groupby aggregateO(n) for built-ins, much slower for .apply with Python function
Boolean mask on array of size nO(n)
String concat in loopO(n²) in Python — use list + join

Live-coding tips for full-stack DS rounds

  1. Clarify the input: types, shape, edge cases. fraud-domain-flavored problems often have NULL/missing data buried in the spec.
  2. State approach + complexity before coding.
  3. Write tests first or alongside. Especially for transformation functions.
  4. Use types: def f(x: list[int]) -> dict[str, float]: — interviewers notice.
  5. Talk through tradeoffs out loud: "I'm using a dict for O(1) lookup; could also sort and binary-search."
  6. Handle edge cases explicitly: empty input, single element, ties.
  7. Anticipate the production follow-up: "What would change at 1B rows?" Have an answer.
  8. When stuck, narrate — silence is the enemy.