Section B · Critical

Data Modeling

Kimball star schema, slowly-changing dimensions, OBT, grain. The conceptual heart of analytics engineering — get this right and most other questions get easier.

Grain — the first thing you ask, always

Before designing anything, before writing any SQL, before any modeling discussion: what is the grain of this table?

Grain = what does one row represent. "One row per order." "One row per user per day." "One row per GPU per minute." "One row per inference request."

If you can't state the grain of a table in one sentence, the model isn't right yet.

Senior tell

Saying "what's the grain of this table?" out loud in any modeling discussion. Interviewers physically lean in.

Atomic grain vs aggregated grain

Always model facts at the atomic grain (the lowest possible), then aggregate on top. "One row per order_item" is more atomic than "one row per order". Atomic grain is harder to recreate from aggregated grain; the reverse is easy.

Facts and dimensions — the Kimball vocabulary

Fact tables (fct_*)

  • Hold measures — things you sum/count/average (revenue, count, duration).
  • Tall and narrow — many rows, relatively few columns.
  • Time-anchored — almost always have an event timestamp.
  • Foreign keys to dimensionsuser_id, product_id, date_key.
  • One row per atomic event — order_placed, inference_request, gpu_minute.

Dimension tables (dim_*)

  • Hold attributes / descriptions — name, status, country, tier.
  • Short and wide — fewer rows, many columns.
  • One row per entity — one row per user, one row per product, one row per date.
  • Updated when entities change (slowly — hence "slowly changing dimensions").

Three types of fact tables

TypeWhat it capturesExample
Transaction factOne row per atomic eventfct_orders, fct_inference_requests
Periodic snapshot factOne row per entity per periodfct_daily_account_balance, fct_gpu_hourly_utilization
Accumulating snapshot factOne row per process, columns for each milestonefct_order_lifecycle (ordered, paid, shipped, delivered)

Measure types in facts

  • Additive — sum across any dimension makes sense. Revenue, units sold. Easiest to model.
  • Semi-additive — sums across some dimensions but not time. Account balance — summable across customers, not across time.
  • Non-additive — ratios, percentages. Conversion rate. Margin %. Store the numerator and denominator separately; compute the ratio at query time.

Star schema — the default

A central fact table surrounded by dimension tables, each connected by a foreign key. Visually it looks like a star.

┌──────────┐ │ dim_date │ └────┬─────┘ │ ┌──────────┐ │ ┌────────────┐ │ dim_user ├──────┤──────┤ dim_product│ └──────────┘ │ └────────────┘ ┌────▼─────┐ │ fct_orders│ └────┬─────┘ │ ┌────▼─────┐ │ dim_region│ └──────────┘

Why star schema wins

  • Query-friendly — joins are simple and consistent (one hop from fact to dim).
  • Performance — modern warehouses optimize star schema joins explicitly.
  • BI-tool-friendly — Looker, Tableau, Power BI all model around it.
  • Easy to extend — add a new fact or dim without breaking existing ones.
  • Self-documenting — the structure tells you what the business cares about.

Surrogate keys

Each dimension has a synthetic primary key (a hash or sequence) called a surrogate key. The fact references that, not the natural key.

Why surrogate keys:

  • SCD Type 2 versions need their own keys (a customer can have multiple rows in dim_customer across time).
  • Natural keys can change. Surrogate keys don't.
  • Joins are faster on small integer/hash keys than on long strings.

In dbt: {{ dbt_utils.generate_surrogate_key(['col1', 'col2']) }} generates a hash.

Snowflake schema (the technique, not the warehouse)

Same as star schema, but dimensions are normalized — a dimension can reference other dimension tables.

Example: dim_productdim_categorydim_department. Three tables instead of one wide product dim.

Trade-off:

  • Pro: saves storage, reduces update anomalies.
  • Con: more joins, harder for BI users, slower queries.

Modern best practice: denormalize dimensions (single wide dim_product table) unless you have a specific reason to snowflake. Storage is cheap; query simplicity is valuable.

One Big Table (OBT) — the modern alternative

Instead of fact + dimensions joined at query time, materialize a single wide table that pre-joins everything. Each fact row already contains all dimension attributes.

Pros

  • No joins for analysts — point and click in BI.
  • Faster query latency on flat data.
  • Simpler mental model.

Cons

  • Storage cost — denormalized data balloons.
  • Maintenance — when a dim attribute changes, every fact row needs updating.
  • SCD handling is messier.
  • Less flexible — you've baked in specific joins.
Hybrid approach

Most mature teams use both. Star schema as the canonical model layer; OBT-style "wide" tables built on top for specific BI use cases or for ML feature serving. The wide tables are derived from the star; they're not the source of truth.

Slowly Changing Dimensions — the time-travel problem

Dimensions change. A customer's plan changes from Basic to Pro. A product's price changes. A GPU provider's location changes. How do you model this?

The question is: when looking at a historical fact, do you want to see the dimension attributes as they were then, or as they are now?

SCD types — know all of them by name

TypeWhat it doesUse when
Type 0Never changes. Original values preserved forever.Birth date, original signup date.
Type 1Overwrite. Current value only; history lost.Don't care about history. Email updates, typos fixed.
Type 2Add a new row per change. Versioned with valid_from / valid_to.Most common. When you need point-in-time accuracy.
Type 3Add a column for "previous value." Only keeps one version of history.Rare. When you specifically need "current and previous."
Type 4Maintain a separate history table. Main dim is current-only.Hybrid approach when history is queried rarely.
Type 61+2+3 combined. Type 2 versioning + a "current value" column on each row.You want point-in-time joins and easy "current state" filtering.

SCD Type 2 — the shape

dim_users with SCD2
user_id  | user_key | plan      | valid_from  | valid_to    | is_current
─────────┼──────────┼───────────┼─────────────┼─────────────┼───────────
123      | abc      | basic     | 2024-01-01  | 2024-06-15  | false
123      | def      | pro       | 2024-06-15  | 2025-02-01  | false
123      | ghi      | enterprise| 2025-02-01  | 9999-12-31  | true
456      | jkl      | basic     | 2024-03-10  | 9999-12-31  | true

Joining a fact to an SCD Type 2 dim

This is the interview gotcha. Don't join on user_id alone — you'll fan out.

point-in-time join
SELECT
  o.order_id,
  o.ordered_at,
  o.revenue,
  u.plan AS plan_at_order_time
FROM fct_orders o
JOIN dim_users u
  ON o.user_id = u.user_id
  AND o.ordered_at >= u.valid_from
  AND o.ordered_at <  u.valid_to

This gives the plan the user was on when they placed the order. That's the whole point of SCD2.

For "what plan are they on now?" — filter on is_current and skip the time-range join.

Common bug: open-ended valid_to

Use 9999-12-31 or similar sentinel for the current row's valid_to, never NULL. Otherwise joins with BETWEEN silently exclude current rows. This is in every "Kimball-style SCD" gotcha question.

The date dimension

Every warehouse should have a dim_date table. One row per date, with columns for year, quarter, month, week, day-of-week, fiscal-year, holiday flags, business-day flags.

Why bother

  • BI tools can pivot/filter on rich date attributes (fiscal quarter, week-over-week).
  • Avoids re-deriving date attributes in every query.
  • Holiday and business-day logic lives in one place.
  • Outer-joining dim_date to fact_daily_x guarantees a row for every day (handles gaps).
dim_date example columns
date_day, year, quarter, month, month_name, week, day_of_week, day_name,
fiscal_year, fiscal_quarter, is_weekend, is_business_day, is_holiday,
days_in_month, is_first_of_month, is_last_of_month, ...

dbt's dbt_utils.date_spine macro is the standard way to generate it.

Periodic snapshot fact tables

Some questions don't have a natural transaction grain. "What's the GPU utilization right now?" — there's no single event. You take a snapshot at regular intervals.

Example: fct_gpu_hourly_utilization — one row per GPU per hour. Captures utilization_pct, memory_used_gb, requests_served, customer_id_currently_assigned, etc.

Different from a transaction fact (one row per event) — same shape, different semantics. You can sum measures across GPUs but be careful summing across hours (semi-additive).

Bridge tables — many-to-many

When a fact has a many-to-many relationship to a dim, you can't join directly. Example: an order can have multiple coupons applied; a coupon can apply to many orders.

Bridge table: fct_order_coupons with (order_id, coupon_id). The bridge sits between the fact and the dim.

Warning: joining a fact to a bridge to a dim causes fanout. Aggregate first or use grouping aggregations.

Degenerate dimensions

A dimension that has no attributes — just an ID. Example: order_id on a fact_order_items table. You'd never look up attributes about an order_id from a dim table; the ID itself is the dim.

Live them in the fact table; don't create a 1-column dim_order for them.

Conformed dimensions

The same dim used by multiple fact tables. dim_users joined by fct_orders, fct_logins, and fct_support_tickets all uses the same dim.

The principle: a dim is conformed when its meaning and grain is identical across uses. This is what enables drill-across analysis: "Show me revenue (from orders) and ticket count (from support) by user segment."

Lack of conformed dimensions = "we have three different definitions of customer and the numbers don't tie." Common mid-size company problem.

How to answer "design a data model for X"

The framework that makes you look senior. They'll give you a business — design the warehouse for them.

  1. Ask about the business questions. "Before I model — what are the top three questions the business needs to answer? Revenue by segment? Retention? Unit economics?"
  2. Identify the core entities. Users, products, orders, GPUs, inference requests — whatever's central. These become dimensions.
  3. Identify the core events. What happens in the world that we want to count or measure? Each event becomes a fact table.
  4. State the grain of each fact. "fct_orders is one row per order. fct_inference_requests is one row per inference request."
  5. Identify measures vs attributes. Measures go in facts (numeric, additive where possible). Attributes go in dims (descriptive).
  6. Decide on SCD strategy for each dim. Most are SCD1 (overwrite) by default; SCD2 where point-in-time accuracy matters.
  7. Sketch the diagram. Draw it. Star schema. Foreign keys. Talk through one query path.
  8. Mention edge cases. Late-arriving facts. Backfills. Time zones. Soft deletes. NULL handling.
  9. Discuss the layer strategy. Staging → marts. Where dbt fits.

Worked example: design for an AI inference platform

inference-platform model sketch
FACTS
─────
fct_inference_requests       — one row per API request
  request_id, customer_id, model_id, gpu_id, region_id, date_key,
  ts, latency_ms, input_tokens, output_tokens, cost_usd, status

fct_gpu_hourly_utilization   — one row per GPU per hour
  gpu_id, hour_ts, utilization_pct, memory_used_gb, requests_served,
  revenue_generated_usd, current_customer_id

fct_billing_events           — one row per billing event
  event_id, customer_id, date_key, ts, event_type, amount_usd

DIMENSIONS
──────────
dim_customers (SCD2)         — one row per customer per attribute version
  customer_key, customer_id, plan, region, signup_date, valid_from, valid_to

dim_models                   — one row per model
  model_id, model_name, family, parameter_count, hosted_since

dim_gpus (SCD2)              — one row per GPU per location change
  gpu_key, gpu_id, gpu_type, provider_id, region_id, valid_from, valid_to

dim_providers                — one row per GPU provider
dim_regions                  — one row per region
dim_date                     — one row per calendar day

With this model you can answer:

  • "Revenue per customer per month" — sum cost_usd from fct_inference_requests, point-in-time joined to dim_customers.
  • "GPU utilization by region" — fct_gpu_hourly_utilization joined to dim_gpus (current version) joined to dim_regions.
  • "Provider payout this month" — fct_gpu_hourly_utilization × pricing logic per provider.
  • "Average latency per model" — fct_inference_requests grouped by model_id.

Interview talking points

"How do you start a modeling project?"

"With the business questions, not the source data. I ask the team what the top three to five things they need to answer are, and what 'good' looks like. Then I identify the entities — the dimensions — and the events — the facts. I state the grain of each fact out loud and confirm it with stakeholders before writing SQL. The schema falls out of that conversation, not the other way around. Source data is a constraint; the business question is the goal."

"When do you SCD2 a dimension?"

"Default to SCD1 — overwrite — for most dimensions, because most BI questions ask 'what does the world look like now.' I reach for SCD2 when the business genuinely needs point-in-time accuracy. The clearest case: financial reporting, where you need to know 'what plan was this customer on when they placed this order' — not what plan they're on today. Also: pricing, organizational changes (region/segment), anything regulators might ask about. The cost of SCD2 is more complex joins and more storage; the cost of not SCD2-ing something that needed it is silent wrong numbers in historical reports."

"Star schema or OBT?"

"Star schema as the canonical layer — that's where source-of-truth lives. OBT-style wide tables built on top for specific consumption use cases: self-serve BI, ML feature serving, dashboards where join cost matters. The wide tables are derived from the star; they're not what we test or document — the star is. The hybrid gives you the maintainability of normalized modeling and the query speed of denormalized consumption."