SQL for Product Analytics
The dialect of SQL that product DS loops actually test — window functions, cohorts, funnels, sessionization, gaps-and-islands, and the gotchas interviewers love.
Why SQL is the gatekeeper
Both target JDs name SQL explicitly. Founding-DS JDs want "expert SQL." Leadership-role JDs want "strong command of SQL, Python, and Git." For product DS loops at AI companies, the SQL screen is almost always the first technical filter — usually one or two timed problems where you live-share a screen, get a schema, and write queries against it on the spot.
The bar isn't "knows SQL." The bar is writes correct SQL fast, with appropriate use of window functions, without flinching at gotchas. Interviewers are looking for three signals:
- You reach for window functions before self-joins, when both work.
- You handle NULLs and dedupe deliberately, not by accident.
- You can read a schema and ask one or two sharp clarifying questions before writing anything.
If you can answer the "find the second-highest salary per department" problem with a window function in under 3 minutes including the NULL edge case, you're at the entry bar. If you can also write a 7-day rolling DAU with a partitioned window, you're at the comfortable bar. The senior-IC bar is comfortable plus quasi-experimental quirks like CUPED variance reduction (which we'll meet in 05-advanced-experimentation).
Window functions
Window functions are the dividing line between "SQL fluent" and "SQL strong" for analytics. Everything else in this chapter rests on them. The four families to know cold:
Ranking: ROW_NUMBER, RANK, DENSE_RANK
Pick exactly one row per group ("most recent order per user"), or rank within groups ("top 3 products by revenue per category"). The difference matters in tie cases:
-- Most recent order per user
SELECT user_id, order_id, ordered_at
FROM (
SELECT
user_id,
order_id,
ordered_at,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY ordered_at DESC) AS rn
FROM orders
) t
WHERE rn = 1;
-- ROW_NUMBER: ties broken arbitrarily, exactly 1 row per group
-- RANK: ties share a rank, next rank skips (1,2,2,4)
-- DENSE_RANK: ties share a rank, next rank doesn't skip (1,2,2,3)
Offset: LAG, LEAD
Compare a row to the previous or next row in its partition. Use for diffs, retention, churn detection, time-between-events.
SELECT
user_id,
ordered_at,
LAG(ordered_at) OVER (PARTITION BY user_id ORDER BY ordered_at) AS prev_order_at,
DATE_DIFF('day',
LAG(ordered_at) OVER (PARTITION BY user_id ORDER BY ordered_at),
ordered_at) AS days_since_prev
FROM orders;
Aggregate windows: SUM, AVG, COUNT with OVER
Running totals, moving averages, share-of-total. Frame clauses matter — default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when you specify ORDER BY, which trips people up.
WITH daily AS (
SELECT
event_date,
COUNT(DISTINCT user_id) AS dau
FROM events
GROUP BY event_date
)
SELECT
event_date,
dau,
AVG(dau) OVER (
ORDER BY event_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS dau_7d_avg
FROM daily
ORDER BY event_date;
Use ROWS BETWEEN for moving averages over time, not RANGE. RANGE treats logical ranges (e.g., "all rows with the same timestamp"), which breaks when your data has gaps. This is a classic interviewer gotcha.
First/last: FIRST_VALUE, LAST_VALUE, NTH_VALUE
Carry a value from the first/last row of a partition into every other row. Useful for "what was the user's first plan?" or "what was the last touchpoint before conversion?"
SELECT
user_id,
event_at,
plan,
FIRST_VALUE(plan) OVER (
PARTITION BY user_id
ORDER BY event_at
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_plan
FROM plan_changes;
LAST_VALUE with a default frame returns the current row, not the last in the partition — because the default frame ends at CURRENT ROW. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when you want the partition-wide last.
Cohort retention
"Of users who signed up in week N, how many came back in week N+k?" — the question every product DS gets asked. Two versions: classical retention (came back at all by week N+k) and n-day return (came back exactly on day N+k).
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('week', signup_at) AS cohort_week
FROM users
),
activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('week', event_at) AS activity_week
FROM events
)
SELECT
c.cohort_week,
DATE_DIFF('week', c.cohort_week, a.activity_week) AS weeks_since_signup,
COUNT(DISTINCT c.user_id) AS retained_users,
COUNT(DISTINCT c.user_id) * 1.0 /
FIRST_VALUE(COUNT(DISTINCT c.user_id)) OVER (
PARTITION BY c.cohort_week
ORDER BY DATE_DIFF('week', c.cohort_week, a.activity_week)
) AS retention_rate
FROM cohorts c
LEFT JOIN activity a
ON c.user_id = a.user_id
AND a.activity_week >= c.cohort_week
GROUP BY 1, 2
ORDER BY 1, 2;
The denominator trick (FIRST_VALUE over the cohort partition) avoids the second pass that beginner versions of this query use.
Funnels
"Of users who hit step 1, how many reached step N?" — with an ordering constraint (step N must come after step N-1 in time, for the same user). The naive self-join answer is O(steps²); the right answer uses a window function:
WITH user_steps AS (
SELECT
user_id,
MAX(CASE WHEN event = 'signup' THEN event_at END) AS signup_at,
MAX(CASE WHEN event = 'upload' THEN event_at END) AS upload_at,
MAX(CASE WHEN event = 'publish' THEN event_at END) AS publish_at
FROM events
GROUP BY user_id
)
SELECT
COUNT(*) AS signup_count,
COUNT(CASE WHEN upload_at > signup_at THEN 1 END) AS upload_count,
COUNT(CASE WHEN publish_at > upload_at AND upload_at > signup_at THEN 1 END) AS publish_count
FROM user_steps;
Two refinements interviewers will probe:
- First occurrence vs any occurrence. If a user signs up, uploads, signs up again (via referral?), then uploads — does the second upload count? Usually you want
MIN()for signup and the next-stepMIN()that follows it. Be explicit about the rule. - Time-window funnels. "Reached step N within 7 days of step N-1" requires keeping the previous step's timestamp via window function and filtering on the diff.
Sessionization
Group consecutive events into sessions based on a gap rule ("if more than 30 minutes elapsed, start a new session"). The canonical pattern uses LAG + a running sum on the "is this a new session" flag:
WITH gapped AS (
SELECT
user_id,
event_at,
CASE
WHEN DATE_DIFF('minute',
LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at),
event_at) > 30
OR LAG(event_at) OVER (PARTITION BY user_id ORDER BY event_at) IS NULL
THEN 1 ELSE 0
END AS is_new_session
FROM events
),
sessioned AS (
SELECT
user_id,
event_at,
SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_at) AS session_id
FROM gapped
)
SELECT
user_id,
session_id,
MIN(event_at) AS session_start,
MAX(event_at) AS session_end,
COUNT(*) AS event_count
FROM sessioned
GROUP BY 1, 2;
Gaps and islands
The classical pattern: given a sequence of events, group consecutive runs (subscribers continuously active week-over-week, days a user logged in consecutively). The trick is the "tabibitosan" identity — subtract a running rank from the date to collapse runs into the same group:
WITH daily AS (
SELECT DISTINCT user_id, DATE_TRUNC('day', event_at) AS login_day
FROM events
),
grouped AS (
SELECT
user_id,
login_day,
DATE_DIFF('day', '1970-01-01', login_day) -
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_day) AS streak_key
FROM daily
)
SELECT
user_id,
MIN(login_day) AS streak_start,
MAX(login_day) AS streak_end,
COUNT(*) AS streak_days
FROM grouped
GROUP BY user_id, streak_key
ORDER BY user_id, streak_start;
Read carefully: the streak_key is "date number minus rank". For consecutive dates, both increase by 1, so the difference is constant within a run — and changes only at the gap. Same trick with generate_series for "find missing dates."
The gotchas interviewers love
If they're testing you, they're testing these:
NULL semantics
COUNT(*)counts NULLs,COUNT(col)doesn't.COUNT(DISTINCT col)ignores NULLs.col != 'x'excludes NULL rows. If you want them, writecol IS DISTINCT FROM 'x'(orCOALESCE(col,'') != 'x'in older dialects).NOT IN (subquery)returns zero rows if any subquery row is NULL. UseNOT EXISTS.
Dedupe before you join
If your join key isn't unique on both sides, you'll silently multiply rows. Dedupe with ROW_NUMBER() or DISTINCT explicitly. A sanity check interviewers love: "what's the row count of left, right, joined? Does the joined count make sense?"
Time-zone and DST
Always say which time zone you're truncating to. DATE_TRUNC('day', event_at) in UTC vs the user's local zone produces different daily metrics. The right answer is usually "use the customer's reporting zone, store UTC in the warehouse, convert in the metric query."
Group-by-with-CASE for pivots
Long-to-wide reshapes use SUM(CASE WHEN ... THEN ... END). Don't reach for PIVOT unless your dialect supports it and your interviewer specifically asks.
Interview probes
The questions you should be able to answer cold:
Show probe 1: "Difference between ROW_NUMBER, RANK, DENSE_RANK?"
All three rank within a partition by an order. ROW_NUMBER always assigns 1,2,3 — ties broken arbitrarily. RANK gives ties the same rank, then skips (1,2,2,4). DENSE_RANK gives ties the same rank and doesn't skip (1,2,2,3). Use ROW_NUMBER for "exactly one row per group"; use RANK or DENSE_RANK when ties are real and you want them surfaced.
Show probe 2: "Why does NOT IN (SELECT ...) sometimes return zero rows?"
If the subquery returns even one NULL, the whole NOT IN evaluates to UNKNOWN for every row, which filters everything out. NOT EXISTS uses correlated lookup with proper NULL handling. Default to NOT EXISTS.
Show probe 3: "Why does LAST_VALUE return the current row by default?"
When you specify ORDER BY in a window without an explicit frame, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. LAST_VALUE applies to that frame, so the "last" is the current row. Fix: ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Show probe 4: "How would you compute MAU as a 28-day rolling window per day?"
Build the daily distinct-user log, then for each day count distinct users in the prior 28 days. Naive version: self-join. Better version: precompute first-seen / last-seen and use it; or, in dialects that support it, use HyperLogLog-style approximate distinct counts. Be ready to discuss the "I can't COUNT(DISTINCT) in a window frame" constraint and the workarounds.
Show probe 5: "Walk me through a query that's slow. How do you debug it?"
Six-step protocol: (1) check the query plan / EXPLAIN for full scans or hash spills; (2) confirm partition pruning is happening on date-partitioned tables; (3) check skew in any GROUP BY or JOIN keys; (4) confirm join cardinality — Cartesian explosions hide in many-to-many joins; (5) sample the input tables and see if pre-filtering helps; (6) for warehouse SQL, check whether materializing an intermediate CTE as a table speeds it up.