dbt Deep Dive
Models, sources, tests, snapshots, macros, incremental loads. The defining tool of analytics engineering — fluency here separates senior from mid-level.
What dbt actually is
dbt (data build tool) is — at its core — a SQL compiler with conventions. You write SELECT statements; dbt wraps them in CREATE TABLE / CREATE VIEW / INCREMENTAL INSERT logic and runs them against your warehouse in the right order.
The pieces that make it more than that:
- Dependency graph — uses
{{ ref('model_name') }}to determine build order. The DAG is the entire reason dbt works. - Materialization strategies — view, table, incremental, ephemeral, snapshot. Same SQL, different runtime behavior.
- Tests — first-class SQL-based assertions on data. Run alongside builds.
- Jinja templating — Python-like macros over SQL.
- Documentation as code — yaml descriptions + auto-generated docs site with lineage graph.
- Environments + targets — same code, different warehouses/schemas (dev / staging / prod).
"dbt is the T in ELT — it takes raw loaded data in the warehouse and turns it into analytics-ready, tested, documented models, using version-controlled SQL and a dependency graph."
Project layout — know it
my_dbt_project/
├── dbt_project.yml # project config — name, profile, paths, configs
├── packages.yml # external dbt packages
├── models/
│ ├── staging/ # 1:1 with source tables, light cleanup
│ │ ├── _sources.yml # declares raw schemas
│ │ ├── stg_users.sql
│ │ └── stg_orders.sql
│ ├── intermediate/ # complex joins, reusable building blocks
│ │ └── int_user_lifetime_value.sql
│ └── marts/ # business-facing tables (fct_, dim_)
│ ├── core/
│ │ ├── fct_orders.sql
│ │ └── dim_users.sql
│ └── finance/
│ └── fct_daily_revenue.sql
├── tests/ # singular (custom) tests
├── macros/ # reusable Jinja macros
├── snapshots/ # SCD type 2 tracking
├── seeds/ # CSVs that get loaded as tables
└── analyses/ # one-off SQL files (NOT materialized)
The staging → intermediate → marts pattern
- Staging (
stg_*): one model per source table. Rename columns, cast types, light filtering. No business logic. - Intermediate (
int_*): complex joins and computations that get reused. Optional layer. - Marts (
fct_*,dim_*): the analytics-facing layer. Star schema tables. This is what BI tools query.
This convention comes from dbt Labs' style guide. Most companies follow it. If asked "how do you organize a dbt project?" — name these three layers.
Models & materializations
A model is just a SELECT in a .sql file. dbt wraps it based on the materialization config:
| Materialization | What dbt does | When to use |
|---|---|---|
view | CREATE OR REPLACE VIEW | Cheap, always fresh; OK for small datasets or staging layer |
table | CREATE OR REPLACE TABLE AS SELECT | Full rebuild every run; default for most marts |
incremental | Initial CREATE; subsequent runs MERGE/INSERT new rows only | Large fact tables where full rebuild is too slow/expensive |
ephemeral | Inlined as a CTE in downstream models — no table created | Reusable transformations you don't want to materialize |
materialized_view / dynamic_table | Warehouse-native incremental refresh (Snowflake Dynamic Tables, BigQuery Materialized Views) | Near-real-time freshness with low operational overhead |
Config — three places to set it
-- 1. In the model file (highest precedence for that model):
{{ config(materialized='incremental', unique_key='order_id', on_schema_change='append_new_columns') }}
SELECT ...
-- 2. In a yaml file alongside the model:
-- models/marts/_marts.yml
models:
- name: fct_orders
config:
materialized: incremental
-- 3. In dbt_project.yml (lowest precedence — folder-level defaults):
models:
my_project:
staging:
+materialized: view
marts:
+materialized: table
Most teams set folder-level defaults in dbt_project.yml (staging = view, marts = table) and override at the model level when needed.
Sources — declare your raw data
Sources declare which raw tables in your warehouse dbt depends on. Two reasons they matter:
- Lineage — dbt's DAG starts at source tables.
- Freshness checks — dbt can run
dbt source freshnessto alert when source data is stale.
version: 2
sources:
- name: raw_app
database: raw
schema: app_prod
loaded_at_field: _ingested_at
freshness:
warn_after: {count: 2, period: hour}
error_after: {count: 6, period: hour}
tables:
- name: users
description: "User accounts from the app database"
columns:
- name: id
tests:
- unique
- not_null
- name: orders
loaded_at_field: created_at # override per-table
SELECT
id AS user_id,
email,
created_at,
updated_at,
is_active
FROM {{ source('raw_app', 'users') }}
ref() and source() — the magic
These are the two Jinja functions that make dbt's DAG work:
{{ source('schema', 'table') }}— references a raw source table.{{ ref('model_name') }}— references another dbt model. dbt builds it first, then yours.
Never hard-code a table name. Always use ref(). This:
- Builds the dependency graph.
- Handles environment-specific schemas automatically (dev vs prod).
- Lets dbt resolve cross-database / cross-project references.
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
order_items AS (
SELECT * FROM {{ ref('stg_order_items') }}
),
users AS (
SELECT * FROM {{ ref('stg_users') }}
)
SELECT ...
Tests — first-class data quality
dbt has two kinds of tests:
1. Generic tests (declarative, in yaml)
Built in: unique, not_null, accepted_values, relationships.
version: 2
models:
- name: dim_users
description: "One row per user"
columns:
- name: user_id
description: "Primary key"
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['active', 'inactive', 'churned']
- name: plan_id
tests:
- relationships:
to: ref('dim_plans')
field: plan_id
2. Singular tests (custom SQL in tests/)
A singular test is just a SQL file in tests/ that returns the failing rows. Empty result = test passes.
SELECT *
FROM {{ ref('fct_orders') }}
WHERE revenue < 0
Powerful packages
dbt-utils and dbt-expectations add dozens of tests: expression_is_true, recency, at_least_one, distribution tests, etc. Mention these in interviews to show ecosystem awareness.
Tests can be configured with severity: warn (vs default error). Warn-level tests show in logs but don't fail the run. Useful for "I want to know about this" without paging the on-call. Mention this pattern.
Snapshots — SCD Type 2 for free
A snapshot tracks how a row's columns change over time. Each version gets valid_from and valid_to columns. This is the canonical way to implement Slowly Changing Dimension Type 2 in dbt.
{% snapshot users_snapshot %}
{{
config(
target_schema='snapshots',
unique_key='user_id',
strategy='check',
check_cols=['plan_id', 'status', 'email']
)
}}
SELECT * FROM {{ source('raw_app', 'users') }}
{% endsnapshot %}
Strategies:
check: dbt compares the columns you list; new row written when any change.timestamp: dbt watches anupdated_atcolumn. More efficient when available.
Run with dbt snapshot on a schedule (usually nightly).
Seeds — small CSVs as tables
Drop a CSV into seeds/, run dbt seed, get a table. Good for: country code mappings, manually-curated business rules, fixture data for tests. Not for: anything large or that changes often.
Macros & Jinja
Macros are reusable Jinja functions over SQL. Useful for DRYing up repeated logic.
{% macro cents_to_dollars(column_name, precision=2) %}
ROUND(({{ column_name }} / 100.0), {{ precision }})
{% endmacro %}
SELECT
order_id,
{{ cents_to_dollars('amount_cents') }} AS amount_usd
FROM {{ ref('stg_orders') }}
Common Jinja patterns
{% if target.name == 'dev' %} ... {% endif %}— environment-conditional SQL.{% for col in get_columns_in_relation(ref('x')) %}— iterate over columns at compile time.{{ var('start_date') }}— parameterize via project variables.{{ env_var('SOMETHING') }}— read from environment.
Incremental models — the most-asked-about
When a fact table has billions of rows and you can't afford to rebuild it nightly, you go incremental.
{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge',
on_schema_change='append_new_columns'
) }}
SELECT
event_id,
user_id,
event_type,
occurred_at,
payload
FROM {{ ref('stg_events') }}
{% if is_incremental() %}
-- only process new events on subsequent runs
WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }})
{% endif %}
Incremental strategies
| Strategy | What it does | When |
|---|---|---|
append | Just INSERT new rows | Immutable event logs |
merge | UPSERT by unique_key (warehouse MERGE INTO) | Most common — when rows can update |
delete+insert | Delete matching rows then insert | When MERGE isn't supported or you want partition-replace semantics |
insert_overwrite | Replace whole partitions (BigQuery, Spark) | Partition-based reprocessing |
1. Late-arriving data falls off the window — if your filter is WHERE occurred_at > MAX(occurred_at), an event that arrives 2 hours late but happened 4 hours ago will be missed. Use a wider lookback: > MAX(occurred_at) - INTERVAL '24 hours' + merge.
2. Schema drift breaks the model — when upstream adds a column, the existing table has the old shape. Use on_schema_change='append_new_columns' or 'sync_all_columns'.
3. Full refresh forgotten — when business logic changes, you need dbt run --full-refresh -s fct_events. Document this.
Packages — don't reinvent
dbt has a Hub. Common packages:
dbt-utils— generic tests, macros (generate_surrogate_key,pivot,star,group_by).dbt-expectations— Great Expectations-style tests ported to dbt.codegen— generates yaml from your warehouse schema.elementary— data observability layer over dbt artifacts.dbt_audit_helper— compare two model outputs (for refactors / migrations).
Exposures & the semantic layer
Exposures declare the downstream consumers of your models — dashboards, ML models, applications. They appear in the lineage graph.
version: 2
exposures:
- name: gpu_utilization_dashboard
type: dashboard
maturity: high
owner:
name: Data Team
email: data@example.com
depends_on:
- ref('fct_gpu_hours')
- ref('dim_providers')
Semantic Layer (dbt's metrics framework, also called MetricFlow): centralizes metric definitions. Instead of every dashboard re-defining "MRR," you define it once in YAML, and BI tools query it via a unified API.
Have an opinion on the metrics layer. "I'd reach for dbt Semantic Layer when there's any real risk of metric drift across teams — finance, product, and ops disagreeing on what 'active customer' means is the #1 problem the semantic layer solves."
CLI essentials
# Build everything (run + test)
dbt build
# Just compile templates — no warehouse calls
dbt compile
# Run a specific model and its descendants
dbt run -s fct_orders+
# Run a model and its ancestors
dbt run -s +fct_orders
# Run a folder, models tagged "daily", or by source
dbt run -s marts.finance
dbt run -s tag:daily
dbt run -s source:raw_app+
# Run only models changed in this PR (state-based)
dbt run -s state:modified+ --state ./prod-manifest
# Run tests
dbt test
dbt test -s fct_orders
# Full refresh an incremental model
dbt run --full-refresh -s fct_events
# Build docs site
dbt docs generate
dbt docs serve
The + graph syntax is essential. +model+ = ancestors and descendants. Know it cold.
CI / CD with dbt
Modern dbt teams treat models like code:
- PR opens → CI runs
dbt build -s state:modified+against a staging schema (Slim CI). - Tests must pass before merge.
- Slim CI uses the production manifest as state — only rebuilds what changed.
- Production scheduler (dbt Cloud, Airflow, Dagster) runs
dbt buildon a cron + after upstream loads complete. - Audit packages (
dbt_audit_helper) used during PR review to diff before/after of model output.
Performance & cost
- Materialize wisely — view for staging, table for marts, incremental for big facts.
- Cluster / partition — Snowflake:
cluster_byin config; BigQuery:partition_by+cluster_by; Databricks:partitioned_by+ Z-order. - Limit ephemeral chains — many ephemerals in a chain produce monstrous compiled SQL.
- Tag and schedule — not every model needs to run every hour. Tag by SLA, schedule accordingly.
- Look at
dbt_artifacts/ Elementary — track run times, see which models are getting slow.
Interview talking points
"Staging, intermediate, marts. Staging models are 1:1 with sources — rename columns, cast types, no business logic. Intermediate models are reusable building blocks for complex joins. Marts are the analytics-facing layer organized by business domain — fact tables for events, dimension tables for entities. The convention's not magic but it makes onboarding easier and PR review faster. I'd default staging to views, marts to tables, and reach for incremental only when scale demands it."
"Default to view for staging — cheap, always fresh, no real cost. For marts, table is the default. I reach for incremental when full rebuild is too slow or too expensive — usually event tables or transactional logs at the billions-of-rows scale. For incremental, I think about: what's the unique key for merge, what's the lookback window for late-arriving data, and what happens on schema change. ephemeral for small reusable transformations I don't want to materialize."
"Three layers. One, dbt tests on every model — unique, not_null, relationships, accepted_values at minimum, plus custom singular tests for business rules. Two, source freshness checks that alert when upstream stops loading. Three, a data observability layer — Elementary or Monte Carlo — for things tests can't catch: anomalies, distribution shifts, volume drops. I severity-warn the tests that are signal-not-blocker and severity-error the ones that should fail the build."