SQL Deep Dive
CTEs, window functions, query plans, anti-patterns. The single most-tested skill in analytics-engineering loops — fluency under pressure separates strong candidates from the rest.
SQL is graded on fluency, not cleverness. You should write correct CTEs and window functions without hesitating. Plan to write SQL out loud on a timer — record yourself if you're alone — until each pattern feels automatic. Then drill 11-sql-problems.
Logical query order — know this cold
SQL is written one way and executed another. The logical order:
1. FROM (and JOINs)
2. WHERE (filters rows BEFORE grouping)
3. GROUP BY
4. HAVING (filters groups AFTER grouping)
5. SELECT (expressions evaluated here — that's why aliases aren't available in WHERE)
6. DISTINCT
7. ORDER BY (aliases ARE available here)
8. LIMIT / OFFSET
This explains common errors interviewers love to catch:
- "Why can't I use a SELECT alias in WHERE?" — because WHERE runs before SELECT.
- "Why is COUNT(*) including NULL?" — because COUNT(*) counts rows, not values. COUNT(col) skips NULLs.
- "Why does HAVING work where WHERE didn't?" — because HAVING runs after grouping; WHERE runs before.
Joins — really know them
Five join types you should be able to draw on a whiteboard and explain in one sentence:
| Join | What it returns | Common pitfall |
|---|---|---|
INNER JOIN | Only rows that match on both sides | Silent data loss when keys are NULL or unmatched |
LEFT JOIN | All left rows; right side NULL if unmatched | Row explosion when right side has many matches |
RIGHT JOIN | All right rows; left side NULL if unmatched | Rarely the right choice — rewrite as LEFT JOIN by swapping sides |
FULL OUTER JOIN | All rows from both, NULLs where unmatched | Use when you genuinely need both sides — rare |
CROSS JOIN | Cartesian product | Massive row counts; usually a bug unless intentional |
The fanout problem
The most common SQL bug in analytics engineering. Joining at the wrong grain inflates your numbers:
-- Wrong: revenue per customer, but order_items duplicates orders
SELECT c.customer_id,
SUM(o.revenue) AS total_revenue -- inflated!
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
LEFT JOIN order_items oi ON oi.order_id = o.order_id -- adds rows per item
GROUP BY 1;
-- Right: aggregate first, then join
WITH order_totals AS (
SELECT order_id, SUM(item_revenue) AS revenue
FROM order_items
GROUP BY 1
)
SELECT c.customer_id,
SUM(ot.revenue) AS total_revenue
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
LEFT JOIN order_totals ot ON ot.order_id = o.order_id
GROUP BY 1;
When discussing joins, say: "the grain of this CTE is one row per order; if I join to order_items I'll fan out to one row per item, so I'd aggregate first." That sentence signals senior-level thinking.
NULL handling — the gotcha that trips everyone
NULL is not a value. It's absence of value. Three behaviors that catch candidates:
NULL = NULLis NULL, notTRUE. UseIS NULL/IS NOT NULL.NULL <> 'x'is NULL, notTRUE. SoWHERE col <> 'x'silently excludes NULL rows.- Arithmetic with NULL is NULL.
NULL + 1 = NULL.SUMskips NULLs;AVGskips NULLs (and divides by non-null count);COUNT(*)includes them,COUNT(col)skips them.
Safe patterns:
-- Coalesce to a default before comparing
WHERE COALESCE(status, 'unknown') <> 'active'
-- Or be explicit
WHERE status <> 'active' OR status IS NULL
-- Null-safe equality (Postgres / Snowflake)
WHERE col IS DISTINCT FROM other_col -- treats NULLs as equal to each other
WHERE col IS NOT DISTINCT FROM other_col -- inverse
GROUP BY semantics
Every non-aggregated column in SELECT must appear in GROUP BY (with rare exceptions like Postgres allowing functionally-dependent columns). Patterns:
- GROUP BY 1, 2, 3 — refer to SELECT columns by position. Common in interview code; some teams ban it for clarity.
- GROUP BY ROLLUP / CUBE / GROUPING SETS — multi-level aggregation in one query.
- HAVING vs WHERE — HAVING filters after grouping; WHERE filters before. Use WHERE whenever possible (smaller rowsets to group).
CTEs and structuring complex queries
CTEs (Common Table Expressions) are the single biggest readability lever in SQL. Use them liberally.
WITH
active_customers AS (
SELECT customer_id
FROM customers
WHERE is_active AND created_at >= '2024-01-01'
),
recent_orders AS (
SELECT customer_id,
SUM(revenue) AS revenue_30d,
COUNT(*) AS orders_30d
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY 1
)
SELECT ac.customer_id,
COALESCE(ro.revenue_30d, 0) AS revenue_30d,
COALESCE(ro.orders_30d, 0) AS orders_30d
FROM active_customers ac
LEFT JOIN recent_orders ro USING (customer_id)
ORDER BY revenue_30d DESC;
Pattern: one CTE per concept
Name each CTE for what it represents: active_customers, monthly_revenue, first_purchase_per_customer. Reviewers should be able to read CTE names and understand the query structure without looking inside.
Performance note
In older Postgres (< 12), CTEs were "optimization fences" — the planner couldn't see through them. Modern warehouses (Snowflake, BigQuery, modern Postgres) inline CTEs by default. You can use CTEs liberally without performance penalty in the warehouses you're likely to interview on.
Recursive CTEs
For hierarchies (org charts, category trees), use WITH RECURSIVE:
WITH RECURSIVE org_tree AS (
-- anchor: the CEO
SELECT employee_id, manager_id, name, 1 AS depth
FROM employees
WHERE manager_id IS NULL
UNION ALL
-- recursive step: anyone whose manager is already in the tree
SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree ORDER BY depth, name;
Know they exist. Know the shape (anchor + recursive step + UNION ALL). Practical use is rare but they come up.
Window functions — the analytics engineer's superpower
If you can't write window functions fluently, you will fail SQL interviews. They're the tool for "do something per group without collapsing rows."
The shape
function_name(...) OVER (
PARTITION BY ... -- like GROUP BY, but doesn't collapse rows
ORDER BY ... -- ordering within partition (required for some functions)
ROWS BETWEEN ... AND ... -- frame: which rows the window covers
)
Functions you must know
| Function | What it does | Typical use |
|---|---|---|
ROW_NUMBER() | 1, 2, 3, ... per partition | Dedup ("keep latest"), top-N per group |
RANK() | Ties get same rank, gap follows (1, 2, 2, 4) | Leaderboards, ranking |
DENSE_RANK() | Ties get same rank, no gap (1, 2, 2, 3) | Same as RANK but compact |
LAG(col, n) | Value from n rows earlier in partition | Period-over-period, sessionization |
LEAD(col, n) | Value from n rows later | Same kind of analysis, looking forward |
FIRST_VALUE / LAST_VALUE | First / last value in window frame | First purchase, latest status |
SUM/AVG/MIN/MAX OVER(...) | Aggregate as window | Running totals, moving averages |
NTILE(n) | Divide partition into n buckets | Quartiles, deciles |
PERCENT_RANK / CUME_DIST | Percentile within partition | Distribution analysis |
Canonical pattern: latest row per group
-- Get the most recent order per customer
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY ordered_at DESC
) AS rn
FROM orders
)
SELECT * FROM ranked WHERE rn = 1;
Canonical pattern: running total
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_revenue
ORDER BY order_date;
Canonical pattern: 7-day moving average
SELECT
date,
daily_value,
AVG(daily_value) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_metrics;
Canonical pattern: period-over-period change
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change,
(revenue - LAG(revenue, 1) OVER (ORDER BY month))
/ NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0) AS mom_pct_change
FROM monthly_revenue
ORDER BY month;
When you use SUM() OVER (ORDER BY ...) without an explicit ROWS clause, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY alone and no PARTITION BY, that's a running total. Without ORDER BY, the default frame is the entire partition. Always specify the frame explicitly for clarity.
QUALIFY — filter on window functions
BigQuery, Snowflake, Databricks, DuckDB all support QUALIFY. It's like HAVING but for window functions. Cleans up the latest-per-group pattern:
-- Instead of wrapping in a CTE:
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY ordered_at DESC
) = 1;
Postgres doesn't support QUALIFY (yet). Know both forms.
Pivot & unpivot
Pivoting rows to columns is common in reports. The portable way:
-- Revenue by quarter, one row per customer
SELECT
customer_id,
SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS q1_revenue,
SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS q2_revenue,
SUM(CASE WHEN quarter = 'Q3' THEN revenue END) AS q3_revenue,
SUM(CASE WHEN quarter = 'Q4' THEN revenue END) AS q4_revenue
FROM quarterly_revenue
GROUP BY customer_id;
Snowflake and BigQuery also have PIVOT / UNPIVOT operators. The CASE WHEN form works everywhere and is what most interviewers expect.
JSON / semi-structured data
Modern warehouses handle JSON natively. Syntax varies; the operations are the same:
-- Snowflake
SELECT raw:user.email::STRING AS email FROM events;
-- BigQuery
SELECT JSON_VALUE(raw, '$.user.email') AS email FROM events;
-- Postgres
SELECT raw ->> 'user' ->> 'email' AS email FROM events;
SELECT raw #>> '{user,email}' AS email FROM events; -- path operator
-- Snowflake / BigQuery: array unnesting
SELECT id, item.value AS product
FROM orders, UNNEST(items) AS item; -- BigQuery
SELECT o.id, item.value::STRING AS product
FROM orders o, LATERAL FLATTEN(o.items) item; -- Snowflake
Know how to extract scalar values, unnest arrays, and cast to strong types. Inference logs are usually JSON; you'll touch this every day at an AI infra company.
Date / time — the source of most bugs
Date arithmetic is dialect-specific and time-zone-sensitive. Patterns:
-- Snowflake / BigQuery / Postgres-ish
DATE_TRUNC('month', ts) -- truncate to start of month
DATE_TRUNC('week', ts) -- start of week (varies by dialect — Monday vs Sunday)
ts + INTERVAL '1 day'
ts - INTERVAL '7 days'
EXTRACT(YEAR FROM ts)
EXTRACT(DOW FROM ts) -- day of week
DATEDIFF('day', start_ts, end_ts)
DATEADD(month, 1, ts)
-- Generate a date series (BigQuery)
SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2024-01-01', '2024-12-31')) AS dt;
-- Snowflake
SELECT DATEADD(day, seq4(), '2024-01-01') AS dt
FROM TABLE(GENERATOR(ROWCOUNT => 365));
Always know whether your timestamp is UTC, local, or naïve. Date-truncate after converting to the relevant timezone, or you'll get rows in the wrong day. Common bug: "why is the daily revenue chart off by ~3% between US and EU views?" — TZ mishandling.
Performance — what makes warehouse queries slow
You won't be running EXPLAIN ANALYZE in a 45-minute interview, but you should know conceptually what makes queries slow:
- Scanning too much data — no partition / cluster filter. Snowflake clusters by columns you specify; BigQuery partitions by date and clusters within. Always filter on the partition key first.
- Skewed joins — one join key dominates volume (e.g. joining on
customer_id = NULL). Causes one worker to do all the work. - Cartesian explosions — accidental CROSS JOIN from missing ON clause.
- Spilling — query exceeds memory; data spills to disk. Often happens with window functions over huge unfiltered datasets.
- SELECT * — columnar warehouses bill you for column scans. Pulling only what you need is meaningfully cheaper.
- Repeated subqueries — running the same subquery in multiple SELECT branches. Refactor to a CTE.
- Predicate pushdown — filter early, aggregate after. Apply WHERE in the deepest CTE that can use it.
Query plan reading — the shape
Every modern warehouse shows you a plan. Look for:
- Number of rows scanned at each step — if the bottom of the plan reads billions of rows for a query that should return thousands, you're missing a filter.
- Join order and method — hash joins vs merge joins; broadcast vs shuffle in Spark/Databricks.
- Bytes processed / partitions read — BigQuery shows this directly; it's your cost.
- Spill warnings — Snowflake shows local/remote spill; Databricks shows shuffle write/read.
SQL anti-patterns to call out
SELECT *in production — breaks when upstream adds a column, scans extra bytes.DISTINCTas a band-aid for duplicates — fix the grain, don't paper over it.NOT IN (subquery)when subquery may have NULLs — returns nothing. UseNOT EXISTSorLEFT JOIN ... WHERE IS NULL.- Functions in WHERE that break indexes —
WHERE LOWER(email) = 'x'can't use an index onemail. Less of an issue in MPP warehouses (no indexes anyway) but matters in Postgres. - Implicit type coercion — comparing string to int. May work, may not, often slow.
- Reusing aliases in WHERE — doesn't work, because WHERE runs before SELECT.
- HAVING used as a filter when WHERE would do — slower because grouping happens first.
Dialect differences worth knowing
You'll be asked "have you used X?" Have one-line answers ready.
| Dialect | Notable features | What's missing |
|---|---|---|
| Snowflake | QUALIFY, LATERAL FLATTEN, time travel, zero-copy clones, dynamic tables | Some standard niceties (FULL OUTER with USING) |
| BigQuery | QUALIFY, UNNEST, scripting/procedures, JS UDFs, partitions + clustering | No row-level updates without WHERE on key |
| Redshift | Sort keys, dist keys, vacuum/analyze | Less ergonomic JSON than Snowflake; older standard support |
| Databricks SQL | Delta Lake operations, MERGE INTO, photon engine, table constraints | QUALIFY is recent; some warehouse niceties newer |
| Postgres | Best standard compliance, JSON ops, extensions (pgvector, etc.) | No QUALIFY, slower at warehouse scale |
| DuckDB | OLAP-on-laptop, fast, modern SQL, QUALIFY, list/array types | Single-node (mostly) |
| ClickHouse | Extreme scan speed for analytics, materialized views, dictionaries | Quirky SQL dialect, weak join performance historically |
How to approach a SQL live coding round
The framework that makes you look senior:
- Restate the schema. "So we have a
customerstable at one row per customer, and anorderstable at one row per order. Is there anorder_itemstable too?" - Clarify grain & constraints. "What's the grain of the output? One row per customer per month?" "Are timestamps UTC?" "Can a customer be deleted?"
- Walk through a tiny example before coding. "For customer A with orders on Jan 1 and Jan 15, the answer would be 2. Right?"
- State the approach. "I'll filter to active customers, then aggregate orders by month per customer, then join."
- Code in CTEs. One concept per CTE. Name them well. Resist the urge to write one giant nested query.
- Talk while you type. Silence reads as stuck. Narration reads as competent.
- Test against the example. Walk your own code line by line against the input.
- Mention edge cases. NULLs, ties in window functions, the date-boundary issue. Even if you don't fix them, calling them out signals seniority.
- Discuss performance. "At scale I'd add a filter on partition_date here. And I'd worry about the grain after this join."
"What's the grain of this CTE?" Say it out loud at every major step. It's the senior data engineer's tic and interviewers love hearing it.