Data Pipelines
ETL vs ELT, batch vs streaming, idempotency, backfills, late-arriving data, schema evolution.
ETL vs ELT
- ETL (Extract, Transform, Load): transform before loading. Older pattern, optimized for storage cost.
- ELT (Extract, Load, Transform): land raw to warehouse, transform inside. The modern default.
Why ELT won:
- Cloud warehouses made compute cheap relative to storage.
- Having raw data lets you re-derive when business logic changes.
- dbt makes warehouse-side transformation ergonomic.
- Multiple consumers can derive different views from the same raw.
ETL still makes sense for specific cases: very high volume where most data is uninteresting (filter early), or PII that must be stripped before landing.
Batch vs streaming
| Approach | Freshness | Complexity | Use when |
|---|---|---|---|
| Daily batch | 24h | Lowest | Most reporting; finance closes |
| Hourly batch | 1h | Low | Operational dashboards |
| Micro-batch (5min) | 5-10min | Medium | "Near real-time" dashboards, ops alerting |
| True streaming | seconds | High | Real-time fraud detection, live monitoring, customer-facing live data |
Most "we need real-time" actually means "we need fresher." Ask: "What's the actual latency requirement?" Often the answer is "users are fine with 5-15 minutes," which makes micro-batch the right pick. True streaming is hard — exactly-once is painful, debugging is harder, observability is harder. Don't pay the streaming cost without a specific latency justification.
Delivery semantics
| Guarantee | What it means | Use |
|---|---|---|
| At-most-once | Each event delivered 0 or 1 times | Telemetry where loss is OK |
| At-least-once | Each event delivered ≥1 times, may duplicate | Most common — combined with idempotent consumer |
| Exactly-once | Each event delivered exactly 1 time | Hard. Usually approximated with at-least-once + idempotency. |
Idempotency — the most-asked-about pattern
A pipeline is idempotent when running it twice for the same window produces the same result. Non-negotiable for production data pipelines.
Pattern: idempotency keys
Every event carries a unique ID. Consumers track seen IDs and skip duplicates.
Pattern: window-replaceable
A pipeline run produces output for a specific window (e.g., a date partition). Re-running replaces the output for that window rather than appending.
-- delete + insert into one partition
DELETE FROM fct_orders WHERE date_key = @run_date;
INSERT INTO fct_orders
SELECT ... FROM source WHERE date(occurred_at) = @run_date;
Run a thousand times for the same @run_date → same output. Idempotent.
Pattern: MERGE / upsert
dbt's incremental "merge" strategy. The unique_key determines which rows are updated vs inserted. Re-running on overlapping data updates rather than duplicating.
Backfills
A backfill is "re-run the pipeline for historical windows." It happens when:
- Business logic changed and you need to apply it to history.
- The pipeline had a bug and produced wrong output.
- You're adding a new metric and need to fill in history.
- You're bootstrapping a new model.
The principle
Backfill should use the production code, not a special-cased script. Special-cased backfill scripts diverge from prod and break.
For dbt: parameterize the model by date variable, then run with the variable set to each window:
for d in $(seq 0 30); do
date_to_backfill=$(date -d "30 days ago + $d days" +%Y-%m-%d)
dbt run --select fct_orders --vars "{run_date: $date_to_backfill}"
done
For Airflow: partitioned DAG runs. Each run is for a specific logical date; backfilling = scheduling the missing logical dates.
Late-arriving data — the silent killer
Data that should have arrived in window N actually arrives in window N+1 (or N+5). If your pipeline filters "only process new" by current timestamp, late data falls off the edge.
Pattern: lookback window
Always look back further than "since last run."
{% if is_incremental() %}
-- 24-hour lookback, merged on unique_key
WHERE occurred_at > (SELECT MAX(occurred_at) FROM {{ this }}) - INTERVAL '24 hours'
{% endif %}
Combined with incremental_strategy='merge' and a unique_key, late events upsert correctly within the lookback window.
Pattern: watermarking (streaming)
Streaming systems use watermarks — "we're now processing data up to this event time; anything later is late." Late events go to a side channel for special handling.
Pattern: reprocessing window
For very late data (days or weeks late), schedule a periodic full-rerun of the prior week. Catches stragglers, costs more, but reliable.
Schema evolution
Upstream schemas change. Strategies for handling it:
Detect
- Source-freshness alerts on missing columns.
- Schema tests that fail when expected columns are missing.
- Schema registry (Confluent, Glue) for streaming events.
Adapt
- dbt
on_schema_change='append_new_columns'— auto-add new columns. on_schema_change='sync_all_columns'— also drop columns that disappear. Use with caution.- Explicit staging contract — staging models list the columns they depend on; missing columns fail loudly.
Coordinate
For breaking changes — column renames, type changes, semantic shifts — there's no auto-magic. This is a coordination conversation. The pattern: data contracts. Producers commit to "this schema for this version," consumers commit to "we only rely on documented columns." Schema changes are versioned and announced.
CDC (Change Data Capture)
Replicate database changes in near-real-time without full table dumps. Tools:
- Debezium — open-source Kafka-based CDC.
- Fivetran HVR, Airbyte, Stitch — managed CDC for major databases.
- Snowflake Streams — track changes within Snowflake.
- BigQuery Datastream — managed CDC into BigQuery.
CDC produces a stream of (op_type, before, after, ts) per row change. The warehouse-side challenge: collapse the stream to the current state of each row, while preserving history for SCD-style queries.
Talking points
"Four layers. Ingestion: source → land raw in the warehouse with retention. Staging: 1:1 with source, light cleanup, no business logic. Transformation: dbt models with tests and lineage. Serving: marts that BI consumes. Orchestration with Airflow or Dagster running on a schedule. For incrementality I'd ask about volume — if it's manageable for full rebuild, table materialization is simpler; if not, incremental with a lookback window and merge strategy. Failure modes I'd plan for: source schema drift (tests fail loud), late-arriving data (lookback window), pipeline failures (orchestrator retries + idempotent steps), backfills (parameterized models)."