Coding & SQL Problems
Drillable problems calibrated to product-DS loops — Python data manipulation problems first, then SQL puzzles. Multiple approaches per problem; drill mode hides solutions.
A · Python data-manipulation problems
P1. Top-K active users from an event stream
Given a list of {user_id, event_at, event_type} records, return the top K users by event count in the last 7 days.
Show solution
from collections import Counter
from datetime import datetime, timedelta
import heapq
def top_k_active(events, k, now=None):
now = now or datetime.utcnow()
cutoff = now - timedelta(days=7)
counts = Counter(
e["user_id"] for e in events if e["event_at"] >= cutoff
)
# heapq.nlargest is O(n log k), better than full sort when k << n
return heapq.nlargest(k, counts.items(), key=lambda x: x[1])
Tradeoff: O(n log k) vs full sort O(n log n). For k=10 over 10M events, ~7× faster. Edge cases: ties (the heap is stable on first-seen; if the question requires alphabetical tiebreak, use key=lambda x: (-x[1], x[0])).
P2. Sessionize an event stream
Given events sorted by user_id, event_at, group them into sessions where consecutive events more than 30 minutes apart start a new session. Return (user_id, session_id, n_events) tuples.
Show solution
from datetime import timedelta
GAP = timedelta(minutes=30)
def sessionize(events):
results = []
if not events:
return results
prev_user = None
prev_time = None
session_id = 0
count = 0
for e in events:
new_session = (
prev_user != e["user_id"]
or (e["event_at"] - prev_time) > GAP
)
if new_session:
if prev_user is not None:
results.append((prev_user, session_id, count))
session_id = session_id + 1 if prev_user == e["user_id"] else 1
count = 0
count += 1
prev_user = e["user_id"]
prev_time = e["event_at"]
# flush last session
results.append((prev_user, session_id, count))
return results
Trap: remembering to flush the final session after the loop ends. Edge cases: empty input, single-event stream, all events from one user vs many users.
P3. Subarray sum equals K (the classical one)
Given an integer array and target K, return the number of contiguous subarrays whose sum equals K.
Show solution
Naive O(n²): two loops over (i, j), sum and check. Fine for n ≤ 10k.
Optimal O(n) — prefix sum + hashmap:
def subarray_sum_eq_k(nums, k):
seen = {0: 1} # prefix sum value → number of times seen
running = 0
answer = 0
for n in nums:
running += n
# If running - k has been seen, that means there's a prefix
# ending earlier whose removal leaves a subarray summing to k
answer += seen.get(running - k, 0)
seen[running] = seen.get(running, 0) + 1
return answer
Edge case: negative numbers — the hashmap handles them; binary search on a sorted prefix array would not. Negative numbers are the standard interview twist.
P4. Compute retention by cohort
Given user signup dates and a list of activity events (user_id, event_date), compute weekly retention: for each cohort (signup week), what fraction returned in weeks 1, 2, 3, 4?
Show solution
from collections import defaultdict
from datetime import timedelta
def cohort_retention(signups, events, max_weeks=4):
# signups: dict user_id -> signup_date
# events: list of (user_id, event_date)
cohort_size = defaultdict(int)
retained = defaultdict(set)
for uid, signup in signups.items():
cohort_week = signup - timedelta(days=signup.weekday())
cohort_size[cohort_week] += 1
for uid, ev_date in events:
signup = signups.get(uid)
if not signup:
continue
weeks_since = (ev_date - signup).days // 7
if 1 <= weeks_since <= max_weeks:
cohort_week = signup - timedelta(days=signup.weekday())
retained[(cohort_week, weeks_since)].add(uid)
return {
(cw, wk): len(retained[(cw, wk)]) / cohort_size[cw]
for (cw, wk) in retained
}
Note: the set is needed because a user can have many events in week N but we count them once. Common bug: using a count instead of a set.
P5. Detect anomalies in a daily metric
Given a list of (date, value) tuples, flag days where the value is more than 3 standard deviations from the rolling 28-day mean (ignoring the day itself).
Show solution
import numpy as np
def flag_anomalies(series, window=28, k=3):
"""series: list of (date, value) sorted by date"""
values = np.array([v for _, v in series])
out = []
for i in range(len(values)):
lo = max(0, i - window)
history = values[lo:i]
if len(history) < 7: # not enough history
out.append(False)
continue
mu, sigma = history.mean(), history.std(ddof=1)
out.append(abs(values[i] - mu) > k * sigma)
return list(zip([d for d, _ in series], out))
Discussion points (score well by raising these unprompted): 3σ is a thin definition of anomaly; a real product would use robust stats (MAD), seasonal decomposition (Prophet, STL), or a control chart with separate weekday/weekend baselines. State this as a follow-up direction.
P6. Two-sum / pair-sum
Classic warmup. Given an array of ints and a target T, return indices of two elements summing to T. Then: how do you generalize to "all pairs that sum to T"?
Show solution
def two_sum(nums, target):
seen = {}
for i, n in enumerate(nums):
if target - n in seen:
return (seen[target - n], i)
seen[n] = i
return None
# All pairs
def all_pairs(nums, target):
seen = {}
out = []
for i, n in enumerate(nums):
for j in seen.get(target - n, []):
out.append((j, i))
seen.setdefault(n, []).append(i)
return out
B · SQL puzzles
P7. N-th highest salary per department
Given employees(emp_id, dept_id, salary), return the 2nd highest salary per department. Discuss tie behavior.
Show solution
-- DENSE_RANK so ties share a rank: "second-highest distinct salary"
SELECT dept_id, salary
FROM (
SELECT
dept_id,
salary,
DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rk
FROM employees
) t
WHERE rk = 2;
The discussion that wins: "Does 'second-highest' mean the second distinct value, or the second row when ties are broken arbitrarily? DENSE_RANK gives the second distinct salary. ROW_NUMBER would give the second-ranked row, breaking ties in undefined order. I'd pick DENSE_RANK and confirm with the interviewer."
P8. Funnel conversion with time constraints
Given events(user_id, event, event_at), compute the conversion rate from signup → upload → publish, where each step must occur within 7 days of the previous step.
Show solution
WITH first_signup AS (
SELECT user_id, MIN(event_at) AS signup_at
FROM events WHERE event = 'signup'
GROUP BY user_id
),
first_upload AS (
SELECT s.user_id, s.signup_at, MIN(e.event_at) AS upload_at
FROM first_signup s
JOIN events e
ON e.user_id = s.user_id
AND e.event = 'upload'
AND e.event_at BETWEEN s.signup_at AND s.signup_at + INTERVAL '7 days'
GROUP BY s.user_id, s.signup_at
),
first_publish AS (
SELECT u.user_id, u.signup_at, u.upload_at, MIN(e.event_at) AS publish_at
FROM first_upload u
JOIN events e
ON e.user_id = u.user_id
AND e.event = 'publish'
AND e.event_at BETWEEN u.upload_at AND u.upload_at + INTERVAL '7 days'
GROUP BY u.user_id, u.signup_at, u.upload_at
)
SELECT
(SELECT COUNT(*) FROM first_signup) AS signups,
(SELECT COUNT(*) FROM first_upload) AS uploads,
(SELECT COUNT(*) FROM first_publish) AS publishes;
The trap: "first" matters. Without MIN() at each step, you'd multiply rows when users have many uploads or publishes. The 7-day constraint is enforced in the JOIN, not as a filter — different behavior on edge cases.
P9. Gaps and islands — consecutive active weeks
Given weekly_active(user_id, week), find for each user the longest consecutive-week active streak.
Show solution
WITH grouped AS (
SELECT
user_id,
week,
-- tabibitosan trick: weeks since epoch minus rank within user
-- is constant within a consecutive run
DATE_DIFF('week', DATE '1970-01-05', week) -
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY week) AS streak_key
FROM weekly_active
),
streaks AS (
SELECT
user_id,
streak_key,
COUNT(*) AS streak_weeks
FROM grouped
GROUP BY user_id, streak_key
)
SELECT user_id, MAX(streak_weeks) AS longest_streak
FROM streaks
GROUP BY user_id;
P10. CTR with delta-method SE for a ratio metric
Given impressions(user_id, n_imps, n_clicks) for treatment and control, compute the pooled CTR per group and a delta-method standard error for the lift.
Show solution
Pooled CTR is straightforward; the standard error is the trap. Per-user CTR averaged across users isn't the same as pooled, and a naive two-proportion test understates variance because both numerator and denominator are user-level random variables.
import numpy as np
def ctr_with_se(imps, clicks):
n = len(imps)
mean_i = imps.mean()
mean_c = clicks.mean()
var_i = imps.var(ddof=1)
var_c = clicks.var(ddof=1)
cov_ic = np.cov(imps, clicks, ddof=1)[0, 1]
ctr = mean_c / mean_i
var_ctr = (var_c / mean_i**2
- 2 * ctr * cov_ic / mean_i**2
+ ctr**2 * var_i / mean_i**2) / n
return ctr, np.sqrt(var_ctr)
Discussion: bootstrap is the practical alternative if the math is fuzzy in the room — you can defend it with "I'd bootstrap to be safe, but the delta method gives a closed-form SE."
Drill protocol
1. Enable drill mode (top of page) so solutions are hidden. 2. Read each problem, give yourself 12 minutes to attempt it on paper or a notebook. 3. Talk through your approach out loud — out loud, not in your head. 4. Reveal the solution. Compare your approach to the reference. 5. Mark "practiced." 6. Aim to clear 6 problems before any onsite.