Drill — Solutions Hidden by Default

Coding & SQL Problems

Drillable problems calibrated to product-DS loops — Python data manipulation problems first, then SQL puzzles. Multiple approaches per problem; drill mode hides solutions.

🎯 10 problems ⏱ ~15 min each 📊 Progress saved
0 / 10 practiced

A · Python data-manipulation problems

P1. Top-K active users from an event stream

Given a list of {user_id, event_at, event_type} records, return the top K users by event count in the last 7 days.

Show solution
top_k_active.py
from collections import Counter
from datetime import datetime, timedelta
import heapq

def top_k_active(events, k, now=None):
    now = now or datetime.utcnow()
    cutoff = now - timedelta(days=7)
    counts = Counter(
        e["user_id"] for e in events if e["event_at"] >= cutoff
    )
    # heapq.nlargest is O(n log k), better than full sort when k << n
    return heapq.nlargest(k, counts.items(), key=lambda x: x[1])

Tradeoff: O(n log k) vs full sort O(n log n). For k=10 over 10M events, ~7× faster. Edge cases: ties (the heap is stable on first-seen; if the question requires alphabetical tiebreak, use key=lambda x: (-x[1], x[0])).

P2. Sessionize an event stream

Given events sorted by user_id, event_at, group them into sessions where consecutive events more than 30 minutes apart start a new session. Return (user_id, session_id, n_events) tuples.

Show solution
sessionize.py
from datetime import timedelta

GAP = timedelta(minutes=30)

def sessionize(events):
    results = []
    if not events:
        return results
    prev_user = None
    prev_time = None
    session_id = 0
    count = 0
    for e in events:
        new_session = (
            prev_user != e["user_id"]
            or (e["event_at"] - prev_time) > GAP
        )
        if new_session:
            if prev_user is not None:
                results.append((prev_user, session_id, count))
            session_id = session_id + 1 if prev_user == e["user_id"] else 1
            count = 0
        count += 1
        prev_user = e["user_id"]
        prev_time = e["event_at"]
    # flush last session
    results.append((prev_user, session_id, count))
    return results

Trap: remembering to flush the final session after the loop ends. Edge cases: empty input, single-event stream, all events from one user vs many users.

P3. Subarray sum equals K (the classical one)

Given an integer array and target K, return the number of contiguous subarrays whose sum equals K.

Show solution

Naive O(n²): two loops over (i, j), sum and check. Fine for n ≤ 10k.

Optimal O(n) — prefix sum + hashmap:

subarray_sum.py
def subarray_sum_eq_k(nums, k):
    seen = {0: 1}      # prefix sum value → number of times seen
    running = 0
    answer = 0
    for n in nums:
        running += n
        # If running - k has been seen, that means there's a prefix
        # ending earlier whose removal leaves a subarray summing to k
        answer += seen.get(running - k, 0)
        seen[running] = seen.get(running, 0) + 1
    return answer

Edge case: negative numbers — the hashmap handles them; binary search on a sorted prefix array would not. Negative numbers are the standard interview twist.

P4. Compute retention by cohort

Given user signup dates and a list of activity events (user_id, event_date), compute weekly retention: for each cohort (signup week), what fraction returned in weeks 1, 2, 3, 4?

Show solution
cohort_retention.py
from collections import defaultdict
from datetime import timedelta

def cohort_retention(signups, events, max_weeks=4):
    # signups: dict user_id -> signup_date
    # events:  list of (user_id, event_date)
    cohort_size = defaultdict(int)
    retained   = defaultdict(set)
    for uid, signup in signups.items():
        cohort_week = signup - timedelta(days=signup.weekday())
        cohort_size[cohort_week] += 1
    for uid, ev_date in events:
        signup = signups.get(uid)
        if not signup:
            continue
        weeks_since = (ev_date - signup).days // 7
        if 1 <= weeks_since <= max_weeks:
            cohort_week = signup - timedelta(days=signup.weekday())
            retained[(cohort_week, weeks_since)].add(uid)
    return {
        (cw, wk): len(retained[(cw, wk)]) / cohort_size[cw]
        for (cw, wk) in retained
    }

Note: the set is needed because a user can have many events in week N but we count them once. Common bug: using a count instead of a set.

P5. Detect anomalies in a daily metric

Given a list of (date, value) tuples, flag days where the value is more than 3 standard deviations from the rolling 28-day mean (ignoring the day itself).

Show solution
anomaly_3sigma.py
import numpy as np

def flag_anomalies(series, window=28, k=3):
    """series: list of (date, value) sorted by date"""
    values = np.array([v for _, v in series])
    out = []
    for i in range(len(values)):
        lo = max(0, i - window)
        history = values[lo:i]
        if len(history) < 7:  # not enough history
            out.append(False)
            continue
        mu, sigma = history.mean(), history.std(ddof=1)
        out.append(abs(values[i] - mu) > k * sigma)
    return list(zip([d for d, _ in series], out))

Discussion points (score well by raising these unprompted): 3σ is a thin definition of anomaly; a real product would use robust stats (MAD), seasonal decomposition (Prophet, STL), or a control chart with separate weekday/weekend baselines. State this as a follow-up direction.

P6. Two-sum / pair-sum

Classic warmup. Given an array of ints and a target T, return indices of two elements summing to T. Then: how do you generalize to "all pairs that sum to T"?

Show solution
two_sum.py
def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        if target - n in seen:
            return (seen[target - n], i)
        seen[n] = i
    return None

# All pairs
def all_pairs(nums, target):
    seen = {}
    out = []
    for i, n in enumerate(nums):
        for j in seen.get(target - n, []):
            out.append((j, i))
        seen.setdefault(n, []).append(i)
    return out

B · SQL puzzles

P7. N-th highest salary per department

Given employees(emp_id, dept_id, salary), return the 2nd highest salary per department. Discuss tie behavior.

Show solution
nth_salary.sql
-- DENSE_RANK so ties share a rank: "second-highest distinct salary"
SELECT dept_id, salary
FROM (
  SELECT
    dept_id,
    salary,
    DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rk
  FROM employees
) t
WHERE rk = 2;

The discussion that wins: "Does 'second-highest' mean the second distinct value, or the second row when ties are broken arbitrarily? DENSE_RANK gives the second distinct salary. ROW_NUMBER would give the second-ranked row, breaking ties in undefined order. I'd pick DENSE_RANK and confirm with the interviewer."

P8. Funnel conversion with time constraints

Given events(user_id, event, event_at), compute the conversion rate from signupuploadpublish, where each step must occur within 7 days of the previous step.

Show solution
funnel_timed.sql
WITH first_signup AS (
  SELECT user_id, MIN(event_at) AS signup_at
  FROM events WHERE event = 'signup'
  GROUP BY user_id
),
first_upload AS (
  SELECT s.user_id, s.signup_at, MIN(e.event_at) AS upload_at
  FROM first_signup s
  JOIN events e
    ON e.user_id = s.user_id
   AND e.event = 'upload'
   AND e.event_at BETWEEN s.signup_at AND s.signup_at + INTERVAL '7 days'
  GROUP BY s.user_id, s.signup_at
),
first_publish AS (
  SELECT u.user_id, u.signup_at, u.upload_at, MIN(e.event_at) AS publish_at
  FROM first_upload u
  JOIN events e
    ON e.user_id = u.user_id
   AND e.event = 'publish'
   AND e.event_at BETWEEN u.upload_at AND u.upload_at + INTERVAL '7 days'
  GROUP BY u.user_id, u.signup_at, u.upload_at
)
SELECT
  (SELECT COUNT(*) FROM first_signup)  AS signups,
  (SELECT COUNT(*) FROM first_upload)  AS uploads,
  (SELECT COUNT(*) FROM first_publish) AS publishes;

The trap: "first" matters. Without MIN() at each step, you'd multiply rows when users have many uploads or publishes. The 7-day constraint is enforced in the JOIN, not as a filter — different behavior on edge cases.

P9. Gaps and islands — consecutive active weeks

Given weekly_active(user_id, week), find for each user the longest consecutive-week active streak.

Show solution
longest_streak.sql
WITH grouped AS (
  SELECT
    user_id,
    week,
    -- tabibitosan trick: weeks since epoch minus rank within user
    -- is constant within a consecutive run
    DATE_DIFF('week', DATE '1970-01-05', week) -
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY week) AS streak_key
  FROM weekly_active
),
streaks AS (
  SELECT
    user_id,
    streak_key,
    COUNT(*) AS streak_weeks
  FROM grouped
  GROUP BY user_id, streak_key
)
SELECT user_id, MAX(streak_weeks) AS longest_streak
FROM streaks
GROUP BY user_id;

P10. CTR with delta-method SE for a ratio metric

Given impressions(user_id, n_imps, n_clicks) for treatment and control, compute the pooled CTR per group and a delta-method standard error for the lift.

Show solution

Pooled CTR is straightforward; the standard error is the trap. Per-user CTR averaged across users isn't the same as pooled, and a naive two-proportion test understates variance because both numerator and denominator are user-level random variables.

delta_method_ctr.py
import numpy as np

def ctr_with_se(imps, clicks):
    n = len(imps)
    mean_i = imps.mean()
    mean_c = clicks.mean()
    var_i = imps.var(ddof=1)
    var_c = clicks.var(ddof=1)
    cov_ic = np.cov(imps, clicks, ddof=1)[0, 1]
    ctr = mean_c / mean_i
    var_ctr = (var_c / mean_i**2
               - 2 * ctr * cov_ic / mean_i**2
               + ctr**2 * var_i / mean_i**2) / n
    return ctr, np.sqrt(var_ctr)

Discussion: bootstrap is the practical alternative if the math is fuzzy in the room — you can defend it with "I'd bootstrap to be safe, but the delta method gives a closed-form SE."

Drill protocol

How to use this page

1. Enable drill mode (top of page) so solutions are hidden. 2. Read each problem, give yourself 12 minutes to attempt it on paper or a notebook. 3. Talk through your approach out loud — out loud, not in your head. 4. Reveal the solution. Compare your approach to the reference. 5. Mark "practiced." 6. Aim to clear 6 problems before any onsite.