Section C · Coding

Coding Fundamentals

The Python patterns DS interviewers actually reach for — dict aggregation, sorting tricks, sliding window, prefix sums, pandas vs SQL, vectorization, and the Big-O intuition that wins live-coding rounds.

What to expect

DS coding rounds at product-analytics-leaning loops are lighter than software-eng loops, but they're not trivial. Expect 30–45 minutes, one problem, often with a follow-up. The problem will be Python and will usually involve:

A list or dict of records (transactions, events, sessions).
An aggregation, ranking, or grouping question.
One follow-up that adds a wrinkle (a time window, a tie-breaking rule, a streaming variant).

You don't usually need fancy algorithms (no Dijkstra, no DP). You need to be fluent with the bread-and-butter patterns and able to communicate while writing. Talking through tradeoffs as you code scores more points than silently producing the optimal solution.

Dict-based aggregation

The single most common pattern. "Given a list of events, return X grouped by user/category/whatever."

basic aggregation patterns

from collections import defaultdict, Counter

events = [
    {"user": "a", "type": "click", "amount": 1},
    {"user": "a", "type": "buy",   "amount": 50},
    {"user": "b", "type": "click", "amount": 1},
]

# Count by user
by_user = Counter(e["user"] for e in events)
# {'a': 2, 'b': 1}

# Sum by user
spend = defaultdict(int)
for e in events:
    spend[e["user"]] += e["amount"]
# {'a': 51, 'b': 1}

# Nested: count by user × type
matrix = defaultdict(lambda: defaultdict(int))
for e in events:
    matrix[e["user"]][e["type"]] += 1
# {'a': {'click': 1, 'buy': 1}, 'b': {'click': 1}}

# Group records into lists by key
groups = defaultdict(list)
for e in events:
    groups[e["user"]].append(e)

The interview tell: reach for Counter when counting, defaultdict(int) when summing, defaultdict(list) when grouping records. Reach for itertools.groupby only when the input is already sorted.

Sorting tricks

Two patterns to know cold:

Sort by a derived key

top-k by composite key

# Sort users by total spend desc, then alphabetically asc as tiebreak
ranked = sorted(spend.items(), key=lambda x: (-x[1], x[0]))

# Top 3 most active users
import heapq
top3 = heapq.nlargest(3, spend.items(), key=lambda x: x[1])

Use heapq.nlargest / nsmallest when k << n — it's O(n log k) instead of O(n log n).

Stable sort with composite keys

Python's sort is stable. Sort by less important key first, more important key last, and the result is correctly ordered by the priority you want. (Or build a tuple key like above — usually cleaner.)

Sliding window & two pointers

For "find a contiguous range with property X" or "find max sum / shortest subarray with constraint."

longest subarray with sum <= k

def longest_subarray_sum_at_most(arr: list[int], k: int) -> int:
    left = 0
    running = 0
    best = 0
    for right, val in enumerate(arr):
        running += val
        while running > k and left <= right:
            running -= arr[left]
            left += 1
        best = max(best, right - left + 1)
    return best

Two-pointer is the same shape with different invariants. The trick is identifying what to maintain in the window — a sum, a count, a set of seen elements.

Prefix sums & running stats

For "many range queries over a fixed array" — O(n) preprocessing, O(1) per query. Senior-DS variant: rolling counts, rolling sums for time-series.

prefix sum for range queries

def make_prefix(arr):
    prefix = [0] * (len(arr) + 1)
    for i, v in enumerate(arr):
        prefix[i + 1] = prefix[i] + v
    return prefix

prefix = make_prefix([3, 1, 4, 1, 5, 9, 2, 6])
def range_sum(l, r):  # inclusive
    return prefix[r + 1] - prefix[l]

The famous DS interview problem "subarray sum equals k" is a prefix-sum + hash-map trick:

subarray sum equals k

def subarray_sum_eq_k(nums: list[int], k: int) -> int:
    seen = {0: 1}  # prefix sum → count
    running = 0
    answer = 0
    for n in nums:
        running += n
        answer += seen.get(running - k, 0)
        seen[running] = seen.get(running, 0) + 1
    return answer

Pandas vs SQL

Common live-coding question: "do this in pandas." The trick is recognizing that most DS-flavored questions are SQL with extra steps. Map each operation:

SQL	Pandas
`SELECT col FROM t`	`df['col']`
`WHERE col > 5`	`df[df['col'] > 5]`
`GROUP BY a, b`	`df.groupby(['a', 'b'])`
`ORDER BY a DESC`	`df.sort_values('a', ascending=False)`
`JOIN ... ON ...`	`df1.merge(df2, on='id', how='left')`
`ROW_NUMBER() OVER (PARTITION BY)`	`df.groupby('a').cumcount() + 1`
`LAG / LEAD`	`df.groupby('a')['v'].shift(1)`
`SUM() OVER (...)`	`df.groupby('a')['v'].cumsum()`

Pandas trap

df.groupby('a').apply(fn) is convenient but slow on large data. Prefer the built-in aggregations (.sum(), .agg({'col': 'mean'})) when possible. If you need .apply, say "this would not scale on 100M rows — at scale I'd push it to SQL or vectorize."

Vectorization

Loops over numpy arrays or pandas series are slow. Vectorized operations are 10–100× faster. Two patterns:

Replace a loop with elementwise ops

slow vs vectorized

import numpy as np
prices = np.array([100, 200, 150, 175])
discounts = np.array([0.1, 0.15, 0.2, 0.05])

# Slow
sale = [p * (1 - d) for p, d in zip(prices, discounts)]

# Vectorized — 10–100x faster
sale = prices * (1 - discounts)

Use `np.where` instead of if/else

conditional vectorization

df['tier'] = np.where(df['spend'] > 100, 'high',
              np.where(df['spend'] > 10, 'mid', 'low'))

Big-O cheat sheet

Operation	Complexity
Dict / set lookup, insert	O(1) amortized
Sort	O(n log n)
Linear scan	O(n)
Two nested loops	O(n²)
Heap push/pop	O(log n)
Top-k via heap	O(n log k)
`in` on a list	O(n)
`in` on a set / dict	O(1)
String concat in a loop	O(n²) in Python — use list + join

Live-coding tips

Clarify before coding. Restate the problem. Ask: input format? Edge cases (empty input, ties, NULLs)? Expected output format? Allowed time/space?
State your approach before writing code. "I'm going to use a dict to count, then sort by count. O(n log n)." Get a nod first.
Write tests / examples next. Even 2–3 examples manually traced. Catches off-by-ones early.
Code, narrating. Talk through each line. Silence in a live-coding round is bad.
Trace through your own example after coding. Don't ask the interviewer if it's right — verify yourself.
Handle edge cases explicitly: empty input, single-element input, ties.
Discuss tradeoffs: "I could use a hash map for O(n) time at O(n) space, or sort for O(n log n) at O(1) space." Pick one and defend.
When stuck, narrate the stuck: "I'm thinking about how to handle the tie-breaking case…" The interviewer will often nudge.

The senior-DS signal

Senior DS candidates ask one more clarifying question than mid-level. They say "if this were a one-off analysis I'd do X, but if this needed to run daily at scale I'd do Y" — without being asked. They tell the interviewer when their solution is good enough vs when they'd want to iterate further.