Orchestration
Airflow, Dagster, Prefect, dbt Cloud — DAGs, dependencies, retries, idempotency, and what to choose when.
What an orchestrator does
Five things:
- Schedule — run things on a cron, or based on triggers.
- Sequence — run things in the right order based on dependencies.
- Retry — handle transient failures.
- Observe — show success/failure, runtimes, logs.
- Backfill — re-run historical windows.
For analytics engineering specifically: the orchestrator runs ingestion + dbt + downstream pushes on a schedule, with the right dependencies and retry behavior.
DAGs (Directed Acyclic Graphs)
Tasks with dependencies. ingest_orders → run_dbt_staging → run_dbt_marts → refresh_dashboards.
"Acyclic" matters — no cycles, so the graph has a clear topological order. Every orchestrator at the bottom uses some flavor of DAG.
Airflow
The default. Python DAGs defined as code. Mature, huge ecosystem.
from airflow import DAG
from airflow.decorators import task
from datetime import datetime, timedelta
with DAG(
'daily_analytics',
start_date=datetime(2024, 1, 1),
schedule='@daily',
catchup=False,
default_args={'retries': 2, 'retry_delay': timedelta(minutes=5)},
) as dag:
@task
def ingest_orders():
# fetch from source, land in warehouse
...
@task
def run_dbt():
# invoke dbt build
...
@task
def refresh_dashboards():
# trigger BI refresh
...
ingest_orders() >> run_dbt() >> refresh_dashboards()
Strengths: mature, ubiquitous, huge operator library, well-documented.
Pain points: verbose for data-team patterns, weak data-asset model (tasks not data), heavy operational overhead, scheduler bottlenecks at scale.
Managed options: AWS MWAA, Google Cloud Composer, Astronomer.
Dagster
Modern alternative. Asset-oriented: you declare what data should exist, not what tasks to run. Dagster figures out the run plan.
from dagster import asset
@asset
def orders():
# this function produces the 'orders' asset
return fetch_orders_from_api()
@asset(deps=[orders])
def dbt_marts():
# depends on orders; runs after
return invoke_dbt()
Strengths: better data-team ergonomics, native dbt integration, software-defined assets, strong dev mode, better testing story.
Pain points: smaller ecosystem than Airflow, less standard in enterprise shops.
Managed: Dagster+ (Dagster Cloud).
Prefect
Python-native, lighter weight than Airflow. Hybrid execution model (compute runs anywhere; orchestration in Prefect Cloud).
from prefect import flow, task
@task(retries=3)
def ingest_orders():
...
@task
def run_dbt():
...
@flow
def daily_analytics():
orders = ingest_orders()
run_dbt()
if __name__ == '__main__':
daily_analytics.serve(name='daily', cron='0 2 * * *')
Strengths: simplest dev experience, no separate scheduler infra needed, decorators feel natural.
Pain points: smaller ecosystem.
dbt Cloud scheduler
If the only thing you orchestrate is dbt, dbt Cloud's scheduler may be enough. Pros: trivial setup, native dbt integration. Cons: limited — can't easily orchestrate non-dbt steps. Once you need ingestion + dbt + downstream in one flow, you need a real orchestrator.
Idempotency & retries
Every task should be safe to retry. If a task fails halfway through, the orchestrator's retry should produce the same final state as a clean first run.
Patterns:
- Per-partition tasks — each run produces output for a specific date partition; re-running replaces that partition.
- MERGE / upsert — dbt incremental + merge strategy handles overlap.
- Transactional writes — use BEGIN/COMMIT, or use staging tables + atomic swap.
- Idempotency keys — for external API writes (sending emails, posting to webhooks), include a key the receiver dedupes against.
Retries with exponential backoff:
- API calls / network: yes.
- Warehouse query timeouts: yes, but with budget.
- Bad logic / wrong SQL: retry doesn't help. Fail fast.
- Schema errors: retry won't fix. Fail fast and alert.
Dependencies — task vs data assets
Two ways to model dependencies:
- Task-based (Airflow classic) — task B runs after task A. Coupling is procedural.
- Asset-based (Dagster, modern Airflow Datasets) — task B depends on the existence/freshness of data asset X, which task A produces. Coupling is by data.
Asset-based is better for analytics engineering — it matches how analysts think (about data tables, not Python functions). Modern Airflow has Datasets to bridge.
Backfills
Most orchestrators support "re-run for a specific logical date." Airflow's airflow dags backfill, Dagster's partition backfills, Prefect's flow runs with custom params.
The principle (see 07-data-pipelines): backfill should use the production code with different parameters. Special-cased backfill scripts diverge from prod.
Interview talking points
"Depends on the team's center of gravity. Airflow if the team has existing Airflow operations or needs maximum ecosystem coverage — every SaaS has an Airflow operator. Dagster if the team is data-team-first and values dev ergonomics — asset-oriented modeling matches how analytics engineers actually think. Prefect for smaller teams that want the simplest path. For a brand new team I'd lean Dagster for analytics work; Airflow remains the safe enterprise choice."
"Three layers. Retries with exponential backoff for transient failures — network, rate limits, query timeouts. Idempotent task design so retries are safe. Clear failure modes for non-retryable errors — schema drift, bad data, auth failures. Alerts go to someone who actually owns the asset. And every failure has a runbook — 'X broke, here's what to check first.' Trying to retry your way out of a logic bug just delays the conversation."