Coding Fundamentals
The Python patterns DS interviewers actually reach for — dict aggregation, sorting tricks, sliding window, prefix sums, pandas vs SQL, vectorization, and the Big-O intuition that wins live-coding rounds.
What to expect
DS coding rounds at product-analytics-leaning loops are lighter than software-eng loops, but they're not trivial. Expect 30–45 minutes, one problem, often with a follow-up. The problem will be Python and will usually involve:
- A list or dict of records (transactions, events, sessions).
- An aggregation, ranking, or grouping question.
- One follow-up that adds a wrinkle (a time window, a tie-breaking rule, a streaming variant).
You don't usually need fancy algorithms (no Dijkstra, no DP). You need to be fluent with the bread-and-butter patterns and able to communicate while writing. Talking through tradeoffs as you code scores more points than silently producing the optimal solution.
Dict-based aggregation
The single most common pattern. "Given a list of events, return X grouped by user/category/whatever."
from collections import defaultdict, Counter
events = [
{"user": "a", "type": "click", "amount": 1},
{"user": "a", "type": "buy", "amount": 50},
{"user": "b", "type": "click", "amount": 1},
]
# Count by user
by_user = Counter(e["user"] for e in events)
# {'a': 2, 'b': 1}
# Sum by user
spend = defaultdict(int)
for e in events:
spend[e["user"]] += e["amount"]
# {'a': 51, 'b': 1}
# Nested: count by user × type
matrix = defaultdict(lambda: defaultdict(int))
for e in events:
matrix[e["user"]][e["type"]] += 1
# {'a': {'click': 1, 'buy': 1}, 'b': {'click': 1}}
# Group records into lists by key
groups = defaultdict(list)
for e in events:
groups[e["user"]].append(e)
The interview tell: reach for Counter when counting, defaultdict(int) when summing, defaultdict(list) when grouping records. Reach for itertools.groupby only when the input is already sorted.
Sorting tricks
Two patterns to know cold:
Sort by a derived key
# Sort users by total spend desc, then alphabetically asc as tiebreak
ranked = sorted(spend.items(), key=lambda x: (-x[1], x[0]))
# Top 3 most active users
import heapq
top3 = heapq.nlargest(3, spend.items(), key=lambda x: x[1])
Use heapq.nlargest / nsmallest when k << n — it's O(n log k) instead of O(n log n).
Stable sort with composite keys
Python's sort is stable. Sort by less important key first, more important key last, and the result is correctly ordered by the priority you want. (Or build a tuple key like above — usually cleaner.)
Sliding window & two pointers
For "find a contiguous range with property X" or "find max sum / shortest subarray with constraint."
def longest_subarray_sum_at_most(arr: list[int], k: int) -> int:
left = 0
running = 0
best = 0
for right, val in enumerate(arr):
running += val
while running > k and left <= right:
running -= arr[left]
left += 1
best = max(best, right - left + 1)
return best
Two-pointer is the same shape with different invariants. The trick is identifying what to maintain in the window — a sum, a count, a set of seen elements.
Prefix sums & running stats
For "many range queries over a fixed array" — O(n) preprocessing, O(1) per query. Senior-DS variant: rolling counts, rolling sums for time-series.
def make_prefix(arr):
prefix = [0] * (len(arr) + 1)
for i, v in enumerate(arr):
prefix[i + 1] = prefix[i] + v
return prefix
prefix = make_prefix([3, 1, 4, 1, 5, 9, 2, 6])
def range_sum(l, r): # inclusive
return prefix[r + 1] - prefix[l]
The famous DS interview problem "subarray sum equals k" is a prefix-sum + hash-map trick:
def subarray_sum_eq_k(nums: list[int], k: int) -> int:
seen = {0: 1} # prefix sum → count
running = 0
answer = 0
for n in nums:
running += n
answer += seen.get(running - k, 0)
seen[running] = seen.get(running, 0) + 1
return answer
Pandas vs SQL
Common live-coding question: "do this in pandas." The trick is recognizing that most DS-flavored questions are SQL with extra steps. Map each operation:
| SQL | Pandas |
|---|---|
SELECT col FROM t | df['col'] |
WHERE col > 5 | df[df['col'] > 5] |
GROUP BY a, b | df.groupby(['a', 'b']) |
ORDER BY a DESC | df.sort_values('a', ascending=False) |
JOIN ... ON ... | df1.merge(df2, on='id', how='left') |
ROW_NUMBER() OVER (PARTITION BY) | df.groupby('a').cumcount() + 1 |
LAG / LEAD | df.groupby('a')['v'].shift(1) |
SUM() OVER (...) | df.groupby('a')['v'].cumsum() |
df.groupby('a').apply(fn) is convenient but slow on large data. Prefer the built-in aggregations (.sum(), .agg({'col': 'mean'})) when possible. If you need .apply, say "this would not scale on 100M rows — at scale I'd push it to SQL or vectorize."
Vectorization
Loops over numpy arrays or pandas series are slow. Vectorized operations are 10–100× faster. Two patterns:
Replace a loop with elementwise ops
import numpy as np
prices = np.array([100, 200, 150, 175])
discounts = np.array([0.1, 0.15, 0.2, 0.05])
# Slow
sale = [p * (1 - d) for p, d in zip(prices, discounts)]
# Vectorized — 10–100x faster
sale = prices * (1 - discounts)
Use np.where instead of if/else
df['tier'] = np.where(df['spend'] > 100, 'high',
np.where(df['spend'] > 10, 'mid', 'low'))
Big-O cheat sheet
| Operation | Complexity |
|---|---|
| Dict / set lookup, insert | O(1) amortized |
| Sort | O(n log n) |
| Linear scan | O(n) |
| Two nested loops | O(n²) |
| Heap push/pop | O(log n) |
| Top-k via heap | O(n log k) |
in on a list | O(n) |
in on a set / dict | O(1) |
| String concat in a loop | O(n²) in Python — use list + join |
Live-coding tips
- Clarify before coding. Restate the problem. Ask: input format? Edge cases (empty input, ties, NULLs)? Expected output format? Allowed time/space?
- State your approach before writing code. "I'm going to use a dict to count, then sort by count. O(n log n)." Get a nod first.
- Write tests / examples next. Even 2–3 examples manually traced. Catches off-by-ones early.
- Code, narrating. Talk through each line. Silence in a live-coding round is bad.
- Trace through your own example after coding. Don't ask the interviewer if it's right — verify yourself.
- Handle edge cases explicitly: empty input, single-element input, ties.
- Discuss tradeoffs: "I could use a hash map for O(n) time at O(n) space, or sort for O(n log n) at O(1) space." Pick one and defend.
- When stuck, narrate the stuck: "I'm thinking about how to handle the tie-breaking case…" The interviewer will often nudge.
Senior DS candidates ask one more clarifying question than mid-level. They say "if this were a one-off analysis I'd do X, but if this needed to run daily at scale I'd do Y" — without being asked. They tell the interviewer when their solution is good enough vs when they'd want to iterate further.