Section B · Core DS

SQL for Product Analytics

The dialect of SQL that product DS loops actually test — window functions, cohorts, funnels, sessionization, gaps-and-islands, and the gotchas interviewers love.

Why SQL is the gatekeeper

Both target JDs name SQL explicitly. Founding-DS JDs want "expert SQL." Leadership-role JDs want "strong command of SQL, Python, and Git." For product DS loops at AI companies, the SQL screen is almost always the first technical filter — usually one or two timed problems where you live-share a screen, get a schema, and write queries against it on the spot.

The bar isn't "knows SQL." The bar is writes correct SQL fast, with appropriate use of window functions, without flinching at gotchas. Interviewers are looking for three signals:

You reach for window functions before self-joins, when both work.
You handle NULLs and dedupe deliberately, not by accident.
You can read a schema and ask one or two sharp clarifying questions before writing anything.

Calibration

If you can answer the "find the second-highest salary per department" problem with a window function in under 3 minutes including the NULL edge case, you're at the entry bar. If you can also write a 7-day rolling DAU with a partitioned window, you're at the comfortable bar. The senior-IC bar is comfortable plus quasi-experimental quirks like CUPED variance reduction (which we'll meet in 05-advanced-experimentation).

Window functions

Window functions are the dividing line between "SQL fluent" and "SQL strong" for analytics. Everything else in this chapter rests on them. The four families to know cold:

Ranking: `ROW_NUMBER`, `RANK`, `DENSE_RANK`

Pick exactly one row per group ("most recent order per user"), or rank within groups ("top 3 products by revenue per category"). The difference matters in tie cases:

row_number vs rank vs dense_rank

-- Most recent order per user
SELECT user_id, order_id, ordered_at
FROM (
  SELECT
    user_id,
    order_id,
    ordered_at,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ordered_at DESC) AS rn
  FROM orders
) t
WHERE rn = 1;

-- ROW_NUMBER: ties broken arbitrarily, exactly 1 row per group
-- RANK:       ties share a rank, next rank skips (1,2,2,4)
-- DENSE_RANK: ties share a rank, next rank doesn't skip (1,2,2,3)

Offset: `LAG`, `LEAD`

Compare a row to the previous or next row in its partition. Use for diffs, retention, churn detection, time-between-events.

days between consecutive orders per user

SELECT
  user_id,
  ordered_at,
  LAG(ordered_at) OVER (PARTITION BY user_id ORDER BY ordered_at) AS prev_order_at,
  DATE_DIFF('day',
    LAG(ordered_at) OVER (PARTITION BY user_id ORDER BY ordered_at),
    ordered_at) AS days_since_prev
FROM orders;

Aggregate windows: `SUM`, `AVG`, `COUNT` with `OVER`

Running totals, moving averages, share-of-total. Frame clauses matter — default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when you specify ORDER BY, which trips people up.

7-day rolling DAU

WITH daily AS (
  SELECT
    event_date,
    COUNT(DISTINCT user_id) AS dau
  FROM events
  GROUP BY event_date
)
SELECT
  event_date,
  dau,
  AVG(dau) OVER (
    ORDER BY event_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS dau_7d_avg
FROM daily
ORDER BY event_date;

Frame trap

Use ROWS BETWEEN for moving averages over time, not RANGE. RANGE treats logical ranges (e.g., "all rows with the same timestamp"), which breaks when your data has gaps. This is a classic interviewer gotcha.

First/last: `FIRST_VALUE`, `LAST_VALUE`, `NTH_VALUE`

Carry a value from the first/last row of a partition into every other row. Useful for "what was the user's first plan?" or "what was the last touchpoint before conversion?"

first plan per user, carried forward

SELECT
  user_id,
  event_at,
  plan,
  FIRST_VALUE(plan) OVER (
    PARTITION BY user_id
    ORDER BY event_at
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
  ) AS first_plan
FROM plan_changes;

LAST_VALUE trap

LAST_VALUE with a default frame returns the current row, not the last in the partition — because the default frame ends at CURRENT ROW. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when you want the partition-wide last.

Cohort retention

"Of users who signed up in week N, how many came back in week N+k?" — the question every product DS gets asked. Two versions: classical retention (came back at all by week N+k) and n-day return (came back exactly on day N+k).

classical weekly retention triangle

WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('week', signup_at) AS cohort_week
  FROM users
),
activity AS (
  SELECT DISTINCT
    user_id,
    DATE_TRUNC('week', event_at) AS activity_week
  FROM events
)
SELECT
  c.cohort_week,
  DATE_DIFF('week', c.cohort_week, a.activity_week) AS weeks_since_signup,
  COUNT(DISTINCT c.user_id) AS retained_users,
  COUNT(DISTINCT c.user_id) * 1.0 /
    FIRST_VALUE(COUNT(DISTINCT c.user_id)) OVER (
      PARTITION BY c.cohort_week
      ORDER BY DATE_DIFF('week', c.cohort_week, a.activity_week)
    ) AS retention_rate
FROM cohorts c
LEFT JOIN activity a
  ON c.user_id = a.user_id
  AND a.activity_week >= c.cohort_week
GROUP BY 1, 2
ORDER BY 1, 2;

The denominator trick (FIRST_VALUE over the cohort partition) avoids the second pass that beginner versions of this query use.

Funnels

"Of users who hit step 1, how many reached step N?" — with an ordering constraint (step N must come after step N-1 in time, for the same user). The naive self-join answer is O(steps²); the right answer uses a window function:

signup → upload → publish funnel

WITH user_steps AS (
  SELECT
    user_id,
    MAX(CASE WHEN event = 'signup'  THEN event_at END) AS signup_at,
    MAX(CASE WHEN event = 'upload'  THEN event_at END) AS upload_at,
    MAX(CASE WHEN event = 'publish' THEN event_at END) AS publish_at
  FROM events
  GROUP BY user_id
)
SELECT
  COUNT(*)                                                                  AS signup_count,
  COUNT(CASE WHEN upload_at  > signup_at THEN 1 END)                        AS upload_count,
  COUNT(CASE WHEN publish_at > upload_at AND upload_at > signup_at THEN 1 END) AS publish_count
FROM user_steps;

Two refinements interviewers will probe:

First occurrence vs any occurrence. If a user signs up, uploads, signs up again (via referral?), then uploads — does the second upload count? Usually you want MIN() for signup and the next-step MIN() that follows it. Be explicit about the rule.
Time-window funnels. "Reached step N within 7 days of step N-1" requires keeping the previous step's timestamp via window function and filtering on the diff.

Sessionization

Group consecutive events into sessions based on a gap rule ("if more than 30 minutes elapsed, start a new session"). The canonical pattern uses LAG + a running sum on the "is this a new session" flag:

30-minute gap sessionization

WITH gapped AS (
  SELECT
    user_id,
    event_at,
    CASE
      WHEN DATE_DIFF('minute',
                     LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at),
                     event_at) > 30
        OR LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) IS NULL
      THEN 1 ELSE 0
    END AS is_new_session
  FROM events
),
sessioned AS (
  SELECT
    user_id,
    event_at,
    SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_at) AS session_id
  FROM gapped
)
SELECT
  user_id,
  session_id,
  MIN(event_at) AS session_start,
  MAX(event_at) AS session_end,
  COUNT(*)      AS event_count
FROM sessioned
GROUP BY 1, 2;

Gaps and islands

The classical pattern: given a sequence of events, group consecutive runs (subscribers continuously active week-over-week, days a user logged in consecutively). The trick is the "tabibitosan" identity — subtract a running rank from the date to collapse runs into the same group:

consecutive login streaks per user

WITH daily AS (
  SELECT DISTINCT user_id, DATE_TRUNC('day', event_at) AS login_day
  FROM events
),
grouped AS (
  SELECT
    user_id,
    login_day,
    DATE_DIFF('day', '1970-01-01', login_day) -
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS streak_key
  FROM daily
)
SELECT
  user_id,
  MIN(login_day) AS streak_start,
  MAX(login_day) AS streak_end,
  COUNT(*)       AS streak_days
FROM grouped
GROUP BY user_id, streak_key
ORDER BY user_id, streak_start;

Read carefully: the streak_key is "date number minus rank". For consecutive dates, both increase by 1, so the difference is constant within a run — and changes only at the gap. Same trick with generate_series for "find missing dates."

The gotchas interviewers love

If they're testing you, they're testing these:

NULL semantics

COUNT(*) counts NULLs, COUNT(col) doesn't. COUNT(DISTINCT col) ignores NULLs.
col != 'x' excludes NULL rows. If you want them, write col IS DISTINCT FROM 'x' (or COALESCE(col,'') != 'x' in older dialects).
NOT IN (subquery) returns zero rows if any subquery row is NULL. Use NOT EXISTS.

Dedupe before you join

If your join key isn't unique on both sides, you'll silently multiply rows. Dedupe with ROW_NUMBER() or DISTINCT explicitly. A sanity check interviewers love: "what's the row count of left, right, joined? Does the joined count make sense?"

Time-zone and DST

Always say which time zone you're truncating to. DATE_TRUNC('day', event_at) in UTC vs the user's local zone produces different daily metrics. The right answer is usually "use the customer's reporting zone, store UTC in the warehouse, convert in the metric query."

Group-by-with-CASE for pivots

Long-to-wide reshapes use SUM(CASE WHEN ... THEN ... END). Don't reach for PIVOT unless your dialect supports it and your interviewer specifically asks.

Interview probes

The questions you should be able to answer cold:

Show probe 1: "Difference between ROW_NUMBER, RANK, DENSE_RANK?"

All three rank within a partition by an order. ROW_NUMBER always assigns 1,2,3 — ties broken arbitrarily. RANK gives ties the same rank, then skips (1,2,2,4). DENSE_RANK gives ties the same rank and doesn't skip (1,2,2,3). Use ROW_NUMBER for "exactly one row per group"; use RANK or DENSE_RANK when ties are real and you want them surfaced.

Show probe 2: "Why does NOT IN (SELECT ...) sometimes return zero rows?"

If the subquery returns even one NULL, the whole NOT IN evaluates to UNKNOWN for every row, which filters everything out. NOT EXISTS uses correlated lookup with proper NULL handling. Default to NOT EXISTS.

Show probe 3: "Why does LAST_VALUE return the current row by default?"

When you specify ORDER BY in a window without an explicit frame, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. LAST_VALUE applies to that frame, so the "last" is the current row. Fix: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

Show probe 4: "How would you compute MAU as a 28-day rolling window per day?"

Build the daily distinct-user log, then for each day count distinct users in the prior 28 days. Naive version: self-join. Better version: precompute first-seen / last-seen and use it; or, in dialects that support it, use HyperLogLog-style approximate distinct counts. Be ready to discuss the "I can't COUNT(DISTINCT) in a window frame" constraint and the workarounds.

Show probe 5: "Walk me through a query that's slow. How do you debug it?"

Six-step protocol: (1) check the query plan / EXPLAIN for full scans or hash spills; (2) confirm partition pruning is happening on date-partitioned tables; (3) check skew in any GROUP BY or JOIN keys; (4) confirm join cardinality — Cartesian explosions hide in many-to-many joins; (5) sample the input tables and see if pre-filtering helps; (6) for warehouse SQL, check whether materializing an intermediate CTE as a table speeds it up.