Section D · Operations

Data Observability

Lineage, freshness, volume, schema, distribution. The five things to monitor for any production data system — plus the tools and the discipline around alerts.

The five pillars (Monte Carlo's framing, now standard)

  1. Freshness — when did data last update?
  2. Volume — how many rows arrived?
  3. Schema — what columns and types?
  4. Distribution — what do values look like?
  5. Lineage — where does this come from, what depends on it?

Mention these by name. They're industry standard vocabulary.

Lineage

Knowing how a column propagated from source through transformations to the dashboard. Critical for:

  • Impact analysis — "if we change this raw column, what breaks?"
  • Root-cause analysis — "this dashboard is wrong, where did the bad value enter?"
  • Compliance — "where does this PII end up?"

Tools: dbt docs (lineage within dbt), OpenLineage (cross-tool standard), DataHub, Atlan, Alation (commercial catalogs with lineage).

Freshness

Stale data lies silently. Monitor at every stage:

  • Source freshness — when did the upstream system last load? dbt source freshness + alerts.
  • Model freshness — when did the dbt model last run? dbt-utils recency tests.
  • Dashboard freshness — when did the BI tool last refresh? Most BI tools surface this.

Set thresholds based on SLA. A daily-batch model should warn if not refreshed in 25 hours, error after 30. A near-real-time model: warn in minutes.

Volume

Row counts in expected range. Common patterns:

  • Absolute — table has at least X rows.
  • Relative — today's rows are within N% of trailing average.
  • Statistical — N standard deviations from baseline.

Spikes and drops both matter. A spike could be duplication; a drop could be a broken ingestion. Both deserve alerts.

Schema

Column additions are usually safe. Removals, renames, type changes are usually bugs. Monitor:

  • Expected columns exist.
  • Types haven't changed.
  • Unexpected new columns flagged for review (in case they're PII).

dbt model contracts make this explicit at the model level. Schema registries do it at the event-stream level.

Distribution

The "values look right" pillar. Things to watch:

  • Null proportion — sudden jump in nulls = upstream broke.
  • Value distribution — mean, p50, p95 of numeric columns. Sudden shifts = anomaly.
  • Cardinality — distinct count of categorical columns. New / disappeared categories.
  • Outlier rate — rows in tails of distribution.

This is where ML-driven tools (Monte Carlo, Anomalo) shine — they baseline distributions automatically and flag deviations.

Tools landscape

ToolClassBest for
dbt tests + dbt-utilsOSS, in-warehouseBaseline — every project should have these
dbt-expectationsOSS, in-warehouseDistribution and statistical tests
ElementaryOSS, dbt-nativeAnomaly detection + observability dashboard over dbt artifacts
Great ExpectationsOSS, Python-firstStandalone validation pipelines, non-dbt environments
Monte Carlo, Bigeye, AnomaloCommercial, ML-drivenOrg-wide observability with less config
DataHub, Atlan, AlationCommercial catalogsLineage + catalog + governance at scale
OpenLineageOSS standardCross-tool lineage protocol

Alerts that don't get ignored

The fastest way to break a data team's trust in their own monitoring: pages that fire too often. Discipline:

  • Severity tiers — error (page), warn (Slack channel), info (dashboard only).
  • Owners on every alert — assigned to a model and a person/team.
  • Runbooks — when X alerts, do Y to investigate. Documented in the model's yaml or wiki.
  • Quiet hours — non-critical alerts pause overnight; only true outages page at 3 AM.
  • Alert review — weekly check: which alerts fired, which were noise, tune accordingly.
The alert-fatigue spiral

Every "let me add a test for this" without thinking about severity leads to a future where everything fires constantly and nothing gets investigated. Tests are free to write; alerts are expensive to act on. Be conservative with what pages.

Talking points

"How do you observe a production data pipeline?"

"Five pillars — freshness, volume, schema, distribution, lineage. Freshness via source freshness checks and recency tests; volume via row-count anomaly detection; schema via tests on expected columns and dbt contracts; distribution via dbt-expectations or a commercial tool like Monte Carlo; lineage via dbt docs at minimum, OpenLineage or DataHub at scale. Severity discipline matters — error level for true contracts, warn for signal. Every alert needs an owner and a runbook, or it becomes noise."

"How do you debug when a dashboard's number is wrong?"

"Walk the lineage backwards. Start at the dashboard — what model is it pulling from. Then the model — what does its SQL do, what's the grain, do the tests pass. Then sources — is upstream loading correctly. Common bugs at each layer: fanout from a wrong-grain join, NULL filters dropping rows silently, time-zone confusion, deduplication that didn't deduplicate, a stale incremental that lost late-arriving data. Once root-caused: fix the data, add a test that would have caught it, document the gotcha. Optional final step: postmortem if it's painful enough."