Section B · Critical

SQL Deep Dive

CTEs, window functions, query plans, anti-patterns. The single most-tested skill in analytics-engineering loops — fluency under pressure separates strong candidates from the rest.

Reality check

SQL is graded on fluency, not cleverness. You should write correct CTEs and window functions without hesitating. Plan to write SQL out loud on a timer — record yourself if you're alone — until each pattern feels automatic. Then drill 11-sql-problems.

Logical query order — know this cold

SQL is written one way and executed another. The logical order:

SQL logical order

1. FROM        (and JOINs)
2. WHERE       (filters rows BEFORE grouping)
3. GROUP BY
4. HAVING      (filters groups AFTER grouping)
5. SELECT      (expressions evaluated here — that's why aliases aren't available in WHERE)
6. DISTINCT
7. ORDER BY    (aliases ARE available here)
8. LIMIT / OFFSET

This explains common errors interviewers love to catch:

"Why can't I use a SELECT alias in WHERE?" — because WHERE runs before SELECT.
"Why is COUNT(*) including NULL?" — because COUNT(*) counts rows, not values. COUNT(col) skips NULLs.
"Why does HAVING work where WHERE didn't?" — because HAVING runs after grouping; WHERE runs before.

Joins — really know them

Five join types you should be able to draw on a whiteboard and explain in one sentence:

Join	What it returns	Common pitfall
`INNER JOIN`	Only rows that match on both sides	Silent data loss when keys are NULL or unmatched
`LEFT JOIN`	All left rows; right side NULL if unmatched	Row explosion when right side has many matches
`RIGHT JOIN`	All right rows; left side NULL if unmatched	Rarely the right choice — rewrite as LEFT JOIN by swapping sides
`FULL OUTER JOIN`	All rows from both, NULLs where unmatched	Use when you genuinely need both sides — rare
`CROSS JOIN`	Cartesian product	Massive row counts; usually a bug unless intentional

The fanout problem

The most common SQL bug in analytics engineering. Joining at the wrong grain inflates your numbers:

fanout bug example

-- Wrong: revenue per customer, but order_items duplicates orders
SELECT c.customer_id,
       SUM(o.revenue) AS total_revenue          -- inflated!
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.customer_id
LEFT JOIN order_items oi ON oi.order_id = o.order_id  -- adds rows per item
GROUP BY 1;

-- Right: aggregate first, then join
WITH order_totals AS (
  SELECT order_id, SUM(item_revenue) AS revenue
  FROM order_items
  GROUP BY 1
)
SELECT c.customer_id,
       SUM(ot.revenue) AS total_revenue
FROM customers c
LEFT JOIN orders o   ON o.customer_id = c.customer_id
LEFT JOIN order_totals ot ON ot.order_id = o.order_id
GROUP BY 1;

Mention "grain" unprompted

When discussing joins, say: "the grain of this CTE is one row per order; if I join to order_items I'll fan out to one row per item, so I'd aggregate first." That sentence signals senior-level thinking.

NULL handling — the gotcha that trips everyone

NULL is not a value. It's absence of value. Three behaviors that catch candidates:

NULL = NULL is NULL, not TRUE. Use IS NULL / IS NOT NULL.
NULL <> 'x' is NULL, not TRUE. So WHERE col <> 'x' silently excludes NULL rows.
Arithmetic with NULL is NULL. NULL + 1 = NULL. SUM skips NULLs; AVG skips NULLs (and divides by non-null count); COUNT(*) includes them, COUNT(col) skips them.

Safe patterns:

null-safe SQL

-- Coalesce to a default before comparing
WHERE COALESCE(status, 'unknown') <> 'active'

-- Or be explicit
WHERE status <> 'active' OR status IS NULL

-- Null-safe equality (Postgres / Snowflake)
WHERE col IS DISTINCT FROM other_col       -- treats NULLs as equal to each other
WHERE col IS NOT DISTINCT FROM other_col   -- inverse

GROUP BY semantics

Every non-aggregated column in SELECT must appear in GROUP BY (with rare exceptions like Postgres allowing functionally-dependent columns). Patterns:

GROUP BY 1, 2, 3 — refer to SELECT columns by position. Common in interview code; some teams ban it for clarity.
GROUP BY ROLLUP / CUBE / GROUPING SETS — multi-level aggregation in one query.
HAVING vs WHERE — HAVING filters after grouping; WHERE filters before. Use WHERE whenever possible (smaller rowsets to group).

CTEs and structuring complex queries

CTEs (Common Table Expressions) are the single biggest readability lever in SQL. Use them liberally.

CTE pattern

WITH
  active_customers AS (
    SELECT customer_id
    FROM customers
    WHERE is_active AND created_at >= '2024-01-01'
  ),
  recent_orders AS (
    SELECT customer_id,
           SUM(revenue) AS revenue_30d,
           COUNT(*)     AS orders_30d
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY 1
  )
SELECT ac.customer_id,
       COALESCE(ro.revenue_30d, 0) AS revenue_30d,
       COALESCE(ro.orders_30d, 0)  AS orders_30d
FROM active_customers ac
LEFT JOIN recent_orders ro USING (customer_id)
ORDER BY revenue_30d DESC;

Pattern: one CTE per concept

Name each CTE for what it represents: active_customers, monthly_revenue, first_purchase_per_customer. Reviewers should be able to read CTE names and understand the query structure without looking inside.

Performance note

In older Postgres (< 12), CTEs were "optimization fences" — the planner couldn't see through them. Modern warehouses (Snowflake, BigQuery, modern Postgres) inline CTEs by default. You can use CTEs liberally without performance penalty in the warehouses you're likely to interview on.

Recursive CTEs

For hierarchies (org charts, category trees), use WITH RECURSIVE:

recursive CTE

WITH RECURSIVE org_tree AS (
  -- anchor: the CEO
  SELECT employee_id, manager_id, name, 1 AS depth
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- recursive step: anyone whose manager is already in the tree
  SELECT e.employee_id, e.manager_id, e.name, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.employee_id
)
SELECT * FROM org_tree ORDER BY depth, name;

Know they exist. Know the shape (anchor + recursive step + UNION ALL). Practical use is rare but they come up.

Window functions — the analytics engineer's superpower

If you can't write window functions fluently, you will fail SQL interviews. They're the tool for "do something per group without collapsing rows."

The shape

window function syntax

function_name(...) OVER (
  PARTITION BY ...     -- like GROUP BY, but doesn't collapse rows
  ORDER BY ...         -- ordering within partition (required for some functions)
  ROWS BETWEEN ... AND ...  -- frame: which rows the window covers
)

Functions you must know

Function	What it does	Typical use
`ROW_NUMBER()`	1, 2, 3, ... per partition	Dedup ("keep latest"), top-N per group
`RANK()`	Ties get same rank, gap follows (1, 2, 2, 4)	Leaderboards, ranking
`DENSE_RANK()`	Ties get same rank, no gap (1, 2, 2, 3)	Same as RANK but compact
`LAG(col, n)`	Value from n rows earlier in partition	Period-over-period, sessionization
`LEAD(col, n)`	Value from n rows later	Same kind of analysis, looking forward
`FIRST_VALUE / LAST_VALUE`	First / last value in window frame	First purchase, latest status
`SUM/AVG/MIN/MAX OVER(...)`	Aggregate as window	Running totals, moving averages
`NTILE(n)`	Divide partition into n buckets	Quartiles, deciles
`PERCENT_RANK / CUME_DIST`	Percentile within partition	Distribution analysis

Canonical pattern: latest row per group

latest_per_group.sql

-- Get the most recent order per customer
WITH ranked AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY customer_id
           ORDER BY ordered_at DESC
         ) AS rn
  FROM orders
)
SELECT * FROM ranked WHERE rn = 1;

Canonical pattern: running total

running_total.sql

SELECT
  order_date,
  daily_revenue,
  SUM(daily_revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS cumulative_revenue
FROM daily_revenue
ORDER BY order_date;

Canonical pattern: 7-day moving average

moving_avg.sql

SELECT
  date,
  daily_value,
  AVG(daily_value) OVER (
    ORDER BY date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM daily_metrics;

Canonical pattern: period-over-period change

pop_change.sql

SELECT
  month,
  revenue,
  LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
  revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change,
  (revenue - LAG(revenue, 1) OVER (ORDER BY month))
    / NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0) AS mom_pct_change
FROM monthly_revenue
ORDER BY month;

Frame defaults bite

When you use SUM() OVER (ORDER BY ...) without an explicit ROWS clause, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. With ORDER BY alone and no PARTITION BY, that's a running total. Without ORDER BY, the default frame is the entire partition. Always specify the frame explicitly for clarity.

QUALIFY — filter on window functions

BigQuery, Snowflake, Databricks, DuckDB all support QUALIFY. It's like HAVING but for window functions. Cleans up the latest-per-group pattern:

QUALIFY

-- Instead of wrapping in a CTE:
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY customer_id
  ORDER BY ordered_at DESC
) = 1;

Postgres doesn't support QUALIFY (yet). Know both forms.

Pivot & unpivot

Pivoting rows to columns is common in reports. The portable way:

pivot_with_case.sql

-- Revenue by quarter, one row per customer
SELECT
  customer_id,
  SUM(CASE WHEN quarter = 'Q1' THEN revenue END) AS q1_revenue,
  SUM(CASE WHEN quarter = 'Q2' THEN revenue END) AS q2_revenue,
  SUM(CASE WHEN quarter = 'Q3' THEN revenue END) AS q3_revenue,
  SUM(CASE WHEN quarter = 'Q4' THEN revenue END) AS q4_revenue
FROM quarterly_revenue
GROUP BY customer_id;

Snowflake and BigQuery also have PIVOT / UNPIVOT operators. The CASE WHEN form works everywhere and is what most interviewers expect.

JSON / semi-structured data

Modern warehouses handle JSON natively. Syntax varies; the operations are the same:

JSON access by dialect

-- Snowflake
SELECT raw:user.email::STRING AS email FROM events;

-- BigQuery
SELECT JSON_VALUE(raw, '$.user.email') AS email FROM events;

-- Postgres
SELECT raw ->> 'user' ->> 'email' AS email FROM events;
SELECT raw #>> '{user,email}' AS email FROM events;  -- path operator

-- Snowflake / BigQuery: array unnesting
SELECT id, item.value AS product
FROM orders, UNNEST(items) AS item;  -- BigQuery
SELECT o.id, item.value::STRING AS product
FROM orders o, LATERAL FLATTEN(o.items) item;  -- Snowflake

Know how to extract scalar values, unnest arrays, and cast to strong types. Inference logs are usually JSON; you'll touch this every day at an AI infra company.

Date / time — the source of most bugs

Date arithmetic is dialect-specific and time-zone-sensitive. Patterns:

date snippets

-- Snowflake / BigQuery / Postgres-ish
DATE_TRUNC('month', ts)     -- truncate to start of month
DATE_TRUNC('week', ts)      -- start of week (varies by dialect — Monday vs Sunday)
ts + INTERVAL '1 day'
ts - INTERVAL '7 days'
EXTRACT(YEAR FROM ts)
EXTRACT(DOW FROM ts)        -- day of week
DATEDIFF('day', start_ts, end_ts)
DATEADD(month, 1, ts)

-- Generate a date series (BigQuery)
SELECT * FROM UNNEST(GENERATE_DATE_ARRAY('2024-01-01', '2024-12-31')) AS dt;

-- Snowflake
SELECT DATEADD(day, seq4(), '2024-01-01') AS dt
FROM TABLE(GENERATOR(ROWCOUNT => 365));

Time-zone gotchas

Always know whether your timestamp is UTC, local, or naïve. Date-truncate after converting to the relevant timezone, or you'll get rows in the wrong day. Common bug: "why is the daily revenue chart off by ~3% between US and EU views?" — TZ mishandling.

Performance — what makes warehouse queries slow

You won't be running EXPLAIN ANALYZE in a 45-minute interview, but you should know conceptually what makes queries slow:

Scanning too much data — no partition / cluster filter. Snowflake clusters by columns you specify; BigQuery partitions by date and clusters within. Always filter on the partition key first.
Skewed joins — one join key dominates volume (e.g. joining on customer_id = NULL). Causes one worker to do all the work.
Cartesian explosions — accidental CROSS JOIN from missing ON clause.
Spilling — query exceeds memory; data spills to disk. Often happens with window functions over huge unfiltered datasets.
SELECT * — columnar warehouses bill you for column scans. Pulling only what you need is meaningfully cheaper.
Repeated subqueries — running the same subquery in multiple SELECT branches. Refactor to a CTE.
Predicate pushdown — filter early, aggregate after. Apply WHERE in the deepest CTE that can use it.

Query plan reading — the shape

Every modern warehouse shows you a plan. Look for:

Number of rows scanned at each step — if the bottom of the plan reads billions of rows for a query that should return thousands, you're missing a filter.
Join order and method — hash joins vs merge joins; broadcast vs shuffle in Spark/Databricks.
Bytes processed / partitions read — BigQuery shows this directly; it's your cost.
Spill warnings — Snowflake shows local/remote spill; Databricks shows shuffle write/read.

SQL anti-patterns to call out

SELECT * in production — breaks when upstream adds a column, scans extra bytes.
DISTINCT as a band-aid for duplicates — fix the grain, don't paper over it.
NOT IN (subquery) when subquery may have NULLs — returns nothing. Use NOT EXISTS or LEFT JOIN ... WHERE IS NULL.
Functions in WHERE that break indexes — WHERE LOWER(email) = 'x' can't use an index on email. Less of an issue in MPP warehouses (no indexes anyway) but matters in Postgres.
Implicit type coercion — comparing string to int. May work, may not, often slow.
Reusing aliases in WHERE — doesn't work, because WHERE runs before SELECT.
HAVING used as a filter when WHERE would do — slower because grouping happens first.

Dialect differences worth knowing

You'll be asked "have you used X?" Have one-line answers ready.

Dialect	Notable features	What's missing
Snowflake	QUALIFY, LATERAL FLATTEN, time travel, zero-copy clones, dynamic tables	Some standard niceties (FULL OUTER with USING)
BigQuery	QUALIFY, UNNEST, scripting/procedures, JS UDFs, partitions + clustering	No row-level updates without WHERE on key
Redshift	Sort keys, dist keys, vacuum/analyze	Less ergonomic JSON than Snowflake; older standard support
Databricks SQL	Delta Lake operations, MERGE INTO, photon engine, table constraints	QUALIFY is recent; some warehouse niceties newer
Postgres	Best standard compliance, JSON ops, extensions (pgvector, etc.)	No QUALIFY, slower at warehouse scale
DuckDB	OLAP-on-laptop, fast, modern SQL, QUALIFY, list/array types	Single-node (mostly)
ClickHouse	Extreme scan speed for analytics, materialized views, dictionaries	Quirky SQL dialect, weak join performance historically

How to approach a SQL live coding round

The framework that makes you look senior:

Restate the schema. "So we have a customers table at one row per customer, and an orders table at one row per order. Is there an order_items table too?"
Clarify grain & constraints. "What's the grain of the output? One row per customer per month?" "Are timestamps UTC?" "Can a customer be deleted?"
Walk through a tiny example before coding. "For customer A with orders on Jan 1 and Jan 15, the answer would be 2. Right?"
State the approach. "I'll filter to active customers, then aggregate orders by month per customer, then join."
Code in CTEs. One concept per CTE. Name them well. Resist the urge to write one giant nested query.
Talk while you type. Silence reads as stuck. Narration reads as competent.
Test against the example. Walk your own code line by line against the input.
Mention edge cases. NULLs, ties in window functions, the date-boundary issue. Even if you don't fix them, calling them out signals seniority.
Discuss performance. "At scale I'd add a filter on partition_date here. And I'd worry about the grain after this join."

The line that wins rooms

"What's the grain of this CTE?" Say it out loud at every major step. It's the senior data engineer's tic and interviewers love hearing it.