Drill — Solutions Hidden by Default

Coding & SQL Problems

Drillable problems calibrated to product-DS loops — Python data manipulation problems first, then SQL puzzles. Multiple approaches per problem; drill mode hides solutions.

🎯 10 problems ⏱ ~15 min each 📊 Progress saved

Drill mode — hide all solutions 0 / 10 practiced

A · Python data-manipulation problems

P1. Top-K active users from an event stream

Given a list of {user_id, event_at, event_type} records, return the top K users by event count in the last 7 days.

Show solution

top_k_active.py

from collections import Counter
from datetime import datetime, timedelta
import heapq

def top_k_active(events, k, now=None):
    now = now or datetime.utcnow()
    cutoff = now - timedelta(days=7)
    counts = Counter(
        e["user_id"] for e in events if e["event_at"] >= cutoff
    )
    # heapq.nlargest is O(n log k), better than full sort when k << n
    return heapq.nlargest(k, counts.items(), key=lambda x: x[1])

Tradeoff: O(n log k) vs full sort O(n log n). For k=10 over 10M events, ~7× faster. Edge cases: ties (the heap is stable on first-seen; if the question requires alphabetical tiebreak, use key=lambda x: (-x[1], x[0])).

P2. Sessionize an event stream

Given events sorted by user_id, event_at, group them into sessions where consecutive events more than 30 minutes apart start a new session. Return (user_id, session_id, n_events) tuples.

Show solution

sessionize.py

from datetime import timedelta

GAP = timedelta(minutes=30)

def sessionize(events):
    results = []
    if not events:
        return results
    prev_user = None
    prev_time = None
    session_id = 0
    count = 0
    for e in events:
        new_session = (
            prev_user != e["user_id"]
            or (e["event_at"] - prev_time) > GAP
        )
        if new_session:
            if prev_user is not None:
                results.append((prev_user, session_id, count))
            session_id = session_id + 1 if prev_user == e["user_id"] else 1
            count = 0
        count += 1
        prev_user = e["user_id"]
        prev_time = e["event_at"]
    # flush last session
    results.append((prev_user, session_id, count))
    return results

Trap: remembering to flush the final session after the loop ends. Edge cases: empty input, single-event stream, all events from one user vs many users.

P3. Subarray sum equals K (the classical one)

Given an integer array and target K, return the number of contiguous subarrays whose sum equals K.

Show solution

Naive O(n²): two loops over (i, j), sum and check. Fine for n ≤ 10k.

Optimal O(n) — prefix sum + hashmap:

subarray_sum.py

def subarray_sum_eq_k(nums, k):
    seen = {0: 1}      # prefix sum value → number of times seen
    running = 0
    answer = 0
    for n in nums:
        running += n
        # If running - k has been seen, that means there's a prefix
        # ending earlier whose removal leaves a subarray summing to k
        answer += seen.get(running - k, 0)
        seen[running] = seen.get(running, 0) + 1
    return answer

Edge case: negative numbers — the hashmap handles them; binary search on a sorted prefix array would not. Negative numbers are the standard interview twist.

P4. Compute retention by cohort

Given user signup dates and a list of activity events (user_id, event_date), compute weekly retention: for each cohort (signup week), what fraction returned in weeks 1, 2, 3, 4?

Show solution

cohort_retention.py

from collections import defaultdict
from datetime import timedelta

def cohort_retention(signups, events, max_weeks=4):
    # signups: dict user_id -> signup_date
    # events:  list of (user_id, event_date)
    cohort_size = defaultdict(int)
    retained   = defaultdict(set)
    for uid, signup in signups.items():
        cohort_week = signup - timedelta(days=signup.weekday())
        cohort_size[cohort_week] += 1
    for uid, ev_date in events:
        signup = signups.get(uid)
        if not signup:
            continue
        weeks_since = (ev_date - signup).days // 7
        if 1 <= weeks_since <= max_weeks:
            cohort_week = signup - timedelta(days=signup.weekday())
            retained[(cohort_week, weeks_since)].add(uid)
    return {
        (cw, wk): len(retained[(cw, wk)]) / cohort_size[cw]
        for (cw, wk) in retained
    }

Note: the set is needed because a user can have many events in week N but we count them once. Common bug: using a count instead of a set.

P5. Detect anomalies in a daily metric

Given a list of (date, value) tuples, flag days where the value is more than 3 standard deviations from the rolling 28-day mean (ignoring the day itself).

Show solution

anomaly_3sigma.py

import numpy as np

def flag_anomalies(series, window=28, k=3):
    """series: list of (date, value) sorted by date"""
    values = np.array([v for _, v in series])
    out = []
    for i in range(len(values)):
        lo = max(0, i - window)
        history = values[lo:i]
        if len(history) < 7:  # not enough history
            out.append(False)
            continue
        mu, sigma = history.mean(), history.std(ddof=1)
        out.append(abs(values[i] - mu) > k * sigma)
    return list(zip([d for d, _ in series], out))

Discussion points (score well by raising these unprompted): 3σ is a thin definition of anomaly; a real product would use robust stats (MAD), seasonal decomposition (Prophet, STL), or a control chart with separate weekday/weekend baselines. State this as a follow-up direction.

P6. Two-sum / pair-sum

Classic warmup. Given an array of ints and a target T, return indices of two elements summing to T. Then: how do you generalize to "all pairs that sum to T"?

Show solution

two_sum.py

def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        if target - n in seen:
            return (seen[target - n], i)
        seen[n] = i
    return None

# All pairs
def all_pairs(nums, target):
    seen = {}
    out = []
    for i, n in enumerate(nums):
        for j in seen.get(target - n, []):
            out.append((j, i))
        seen.setdefault(n, []).append(i)
    return out

B · SQL puzzles

P7. N-th highest salary per department

Given employees(emp_id, dept_id, salary), return the 2nd highest salary per department. Discuss tie behavior.

Show solution

nth_salary.sql

-- DENSE_RANK so ties share a rank: "second-highest distinct salary"
SELECT dept_id, salary
FROM (
  SELECT
    dept_id,
    salary,
    DENSE_RANK() OVER (PARTITION BY dept_id ORDER BY salary DESC) AS rk
  FROM employees
) t
WHERE rk = 2;

The discussion that wins: "Does 'second-highest' mean the second distinct value, or the second row when ties are broken arbitrarily? DENSE_RANK gives the second distinct salary. ROW_NUMBER would give the second-ranked row, breaking ties in undefined order. I'd pick DENSE_RANK and confirm with the interviewer."

P8. Funnel conversion with time constraints

Given events(user_id, event, event_at), compute the conversion rate from signup → upload → publish, where each step must occur within 7 days of the previous step.

Show solution

funnel_timed.sql

WITH first_signup AS (
  SELECT user_id, MIN(event_at) AS signup_at
  FROM events WHERE event = 'signup'
  GROUP BY user_id
),
first_upload AS (
  SELECT s.user_id, s.signup_at, MIN(e.event_at) AS upload_at
  FROM first_signup s
  JOIN events e
    ON e.user_id = s.user_id
   AND e.event = 'upload'
   AND e.event_at BETWEEN s.signup_at AND s.signup_at + INTERVAL '7 days'
  GROUP BY s.user_id, s.signup_at
),
first_publish AS (
  SELECT u.user_id, u.signup_at, u.upload_at, MIN(e.event_at) AS publish_at
  FROM first_upload u
  JOIN events e
    ON e.user_id = u.user_id
   AND e.event = 'publish'
   AND e.event_at BETWEEN u.upload_at AND u.upload_at + INTERVAL '7 days'
  GROUP BY u.user_id, u.signup_at, u.upload_at
)
SELECT
  (SELECT COUNT(*) FROM first_signup)  AS signups,
  (SELECT COUNT(*) FROM first_upload)  AS uploads,
  (SELECT COUNT(*) FROM first_publish) AS publishes;

The trap: "first" matters. Without MIN() at each step, you'd multiply rows when users have many uploads or publishes. The 7-day constraint is enforced in the JOIN, not as a filter — different behavior on edge cases.

P9. Gaps and islands — consecutive active weeks

Given weekly_active(user_id, week), find for each user the longest consecutive-week active streak.

Show solution

longest_streak.sql

WITH grouped AS (
  SELECT
    user_id,
    week,
    -- tabibitosan trick: weeks since epoch minus rank within user
    -- is constant within a consecutive run
    DATE_DIFF('week', DATE '1970-01-05', week) -
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY week) AS streak_key
  FROM weekly_active
),
streaks AS (
  SELECT
    user_id,
    streak_key,
    COUNT(*) AS streak_weeks
  FROM grouped
  GROUP BY user_id, streak_key
)
SELECT user_id, MAX(streak_weeks) AS longest_streak
FROM streaks
GROUP BY user_id;

P10. CTR with delta-method SE for a ratio metric

Given impressions(user_id, n_imps, n_clicks) for treatment and control, compute the pooled CTR per group and a delta-method standard error for the lift.

Show solution

Pooled CTR is straightforward; the standard error is the trap. Per-user CTR averaged across users isn't the same as pooled, and a naive two-proportion test understates variance because both numerator and denominator are user-level random variables.

delta_method_ctr.py

import numpy as np

def ctr_with_se(imps, clicks):
    n = len(imps)
    mean_i = imps.mean()
    mean_c = clicks.mean()
    var_i = imps.var(ddof=1)
    var_c = clicks.var(ddof=1)
    cov_ic = np.cov(imps, clicks, ddof=1)[0, 1]
    ctr = mean_c / mean_i
    var_ctr = (var_c / mean_i**2
               - 2 * ctr * cov_ic / mean_i**2
               + ctr**2 * var_i / mean_i**2) / n
    return ctr, np.sqrt(var_ctr)

Discussion: bootstrap is the practical alternative if the math is fuzzy in the room — you can defend it with "I'd bootstrap to be safe, but the delta method gives a closed-form SE."

Drill protocol

How to use this page

1. Enable drill mode (top of page) so solutions are hidden. 2. Read each problem, give yourself 12 minutes to attempt it on paper or a notebook. 3. Talk through your approach out loud — out loud, not in your head. 4. Reveal the solution. Compare your approach to the reference. 5. Mark "practiced." 6. Aim to clear 6 problems before any onsite.

A · Python data-manipulation problems

P1. Top-K active users from an event stream practiced

P2. Sessionize an event stream practiced

P3. Subarray sum equals K (the classical one) practiced

P4. Compute retention by cohort practiced

P5. Detect anomalies in a daily metric practiced

P6. Two-sum / pair-sum practiced

B · SQL puzzles

P7. N-th highest salary per department practiced

P8. Funnel conversion with time constraints practiced

P9. Gaps and islands — consecutive active weeks practiced

P10. CTR with delta-method SE for a ratio metric practiced

Drill protocol

P1. Top-K active users from an event stream

P2. Sessionize an event stream

P3. Subarray sum equals K (the classical one)

P4. Compute retention by cohort

P5. Detect anomalies in a daily metric

P6. Two-sum / pair-sum

P7. N-th highest salary per department

P8. Funnel conversion with time constraints

P9. Gaps and islands — consecutive active weeks

P10. CTR with delta-method SE for a ratio metric