Section D · Operations

Data Observability

Lineage, freshness, volume, schema, distribution. The five things to monitor for any production data system — plus the tools and the discipline around alerts.

The five pillars (Monte Carlo's framing, now standard)

Freshness — when did data last update?
Volume — how many rows arrived?
Schema — what columns and types?
Distribution — what do values look like?
Lineage — where does this come from, what depends on it?

Mention these by name. They're industry standard vocabulary.

Lineage

Knowing how a column propagated from source through transformations to the dashboard. Critical for:

Impact analysis — "if we change this raw column, what breaks?"
Root-cause analysis — "this dashboard is wrong, where did the bad value enter?"
Compliance — "where does this PII end up?"

Tools: dbt docs (lineage within dbt), OpenLineage (cross-tool standard), DataHub, Atlan, Alation (commercial catalogs with lineage).

Freshness

Stale data lies silently. Monitor at every stage:

Source freshness — when did the upstream system last load? dbt source freshness + alerts.
Model freshness — when did the dbt model last run? dbt-utils recency tests.
Dashboard freshness — when did the BI tool last refresh? Most BI tools surface this.

Set thresholds based on SLA. A daily-batch model should warn if not refreshed in 25 hours, error after 30. A near-real-time model: warn in minutes.

Volume

Row counts in expected range. Common patterns:

Absolute — table has at least X rows.
Relative — today's rows are within N% of trailing average.
Statistical — N standard deviations from baseline.

Spikes and drops both matter. A spike could be duplication; a drop could be a broken ingestion. Both deserve alerts.

Schema

Column additions are usually safe. Removals, renames, type changes are usually bugs. Monitor:

Expected columns exist.
Types haven't changed.
Unexpected new columns flagged for review (in case they're PII).

dbt model contracts make this explicit at the model level. Schema registries do it at the event-stream level.

Distribution

The "values look right" pillar. Things to watch:

Null proportion — sudden jump in nulls = upstream broke.
Value distribution — mean, p50, p95 of numeric columns. Sudden shifts = anomaly.
Cardinality — distinct count of categorical columns. New / disappeared categories.
Outlier rate — rows in tails of distribution.

This is where ML-driven tools (Monte Carlo, Anomalo) shine — they baseline distributions automatically and flag deviations.

Tools landscape

Tool	Class	Best for
dbt tests + dbt-utils	OSS, in-warehouse	Baseline — every project should have these
dbt-expectations	OSS, in-warehouse	Distribution and statistical tests
Elementary	OSS, dbt-native	Anomaly detection + observability dashboard over dbt artifacts
Great Expectations	OSS, Python-first	Standalone validation pipelines, non-dbt environments
Monte Carlo, Bigeye, Anomalo	Commercial, ML-driven	Org-wide observability with less config
DataHub, Atlan, Alation	Commercial catalogs	Lineage + catalog + governance at scale
OpenLineage	OSS standard	Cross-tool lineage protocol

Alerts that don't get ignored

The fastest way to break a data team's trust in their own monitoring: pages that fire too often. Discipline:

Severity tiers — error (page), warn (Slack channel), info (dashboard only).
Owners on every alert — assigned to a model and a person/team.
Runbooks — when X alerts, do Y to investigate. Documented in the model's yaml or wiki.
Quiet hours — non-critical alerts pause overnight; only true outages page at 3 AM.
Alert review — weekly check: which alerts fired, which were noise, tune accordingly.

The alert-fatigue spiral

Every "let me add a test for this" without thinking about severity leads to a future where everything fires constantly and nothing gets investigated. Tests are free to write; alerts are expensive to act on. Be conservative with what pages.

Talking points

"How do you observe a production data pipeline?"

"Five pillars — freshness, volume, schema, distribution, lineage. Freshness via source freshness checks and recency tests; volume via row-count anomaly detection; schema via tests on expected columns and dbt contracts; distribution via dbt-expectations or a commercial tool like Monte Carlo; lineage via dbt docs at minimum, OpenLineage or DataHub at scale. Severity discipline matters — error level for true contracts, warn for signal. Every alert needs an owner and a runbook, or it becomes noise."

"How do you debug when a dashboard's number is wrong?"

"Walk the lineage backwards. Start at the dashboard — what model is it pulling from. Then the model — what does its SQL do, what's the grain, do the tests pass. Then sources — is upstream loading correctly. Common bugs at each layer: fanout from a wrong-grain join, NULL filters dropping rows silently, time-zone confusion, deduplication that didn't deduplicate, a stale incremental that lost late-arriving data. Once root-caused: fix the data, add a test that would have caught it, document the gotcha. Optional final step: postmortem if it's painful enough."