Defining the Data Stack
The "lay the foundation" prompt, answered. How a founding DS thinks about the warehouse, event tracking, transformation, experimentation, and BI layers — and where to start when nothing exists yet.
The greenfield prompt
Common founding-DS interview prompt: "We don't have a data team yet. What's the first thing you'd build, and what's the stack at the end of year one?" This is the founding-DS-flavored question, and it's a layered design exercise. You'll be judged on:
- Whether you can name the right layers without missing one (a common trap is forgetting the experimentation platform).
- Whether you pick tools that fit the company stage (no Looker at a 10-person company; no DIY at a 500-person one).
- Whether you sequence the build sensibly (you cannot ship dashboards before the warehouse is real).
- Whether you can defend "the boring answer" — most early-stage stacks should be boring, and a senior DS knows when to resist exotic choices.
The five layers
Modern product-analytics stacks have five layers. Memorize this order; in interviews, walking up from raw events to served metrics signals you've thought through the whole thing:
- Event ingestion — how user actions get from the product into a queue/store. Segment, Rudderstack, Snowplow, or homegrown.
- Warehouse — where all data lands and is queryable. BigQuery, Snowflake, Redshift, Databricks.
- Transformation — how raw events become clean models. dbt, dataform, Coalesce, SQL files.
- Experimentation platform — assignment, exposure logging, analysis. Statsig, Eppo, Optimizely, GrowthBook, or a self-build for simple cases.
- BI / serving — how non-technical users consume data. Looker, Mode, Hex, Metabase, plus piped feeds into Salesforce / Slack / email.
Event ingestion
The decisions:
- Buy vs build: at <100 events/sec, buying (Segment, Rudderstack) is right. At >1k events/sec or for cost reasons, build on top of Kafka/Kinesis.
- Server-side vs client-side: client-side is easier; server-side is more reliable (ad blockers, network errors, mobile background kills). Best practice is dual — client for UX events, server for revenue/auth events you can't afford to lose.
- Schema discipline: agree on an event taxonomy before the engineers start emitting events. Otherwise you'll spend year two cleaning up.
A minimal event taxonomy
Every event should carry: event_id (uuid), event_name, event_at (server timestamp), user_id (when known), anonymous_id, session_id, app_version, platform, properties (JSON blob with event-specific fields). The naming convention should be noun_verb_past_tense: video_published, plan_upgraded, signup_completed.
Warehouse
At Series B–C AI SaaS scale, the right answer is BigQuery or Snowflake. Choose by:
- BigQuery: cheaper for spiky workloads, tighter integration with Google Cloud, scales without thought.
- Snowflake: cleaner SQL, separation of storage and compute, finer cost-attribution. Better for many warehouses across many teams.
Avoid Postgres-as-warehouse past ~100GB of analytics data — it becomes slow and brittle. Avoid Redshift unless you've inherited it (the operational overhead is higher than BigQuery/Snowflake).
Schema layering
Standard three layers:
- Raw: events as they land, untransformed.
- Staging: cleaned, typed, one-to-one with raw.
- Marts: business-friendly models (users, subscriptions, sessions, attribution).
Transformation
Use dbt unless you have a strong reason not to. It's the industry standard, the talent pool knows it, and it produces auditable lineage.
What dbt buys you
- SQL-first transformations with Jinja templating.
- Tests on schemas and data (uniqueness, not-null, accepted-values, custom).
- Lineage graph — useful when stakeholders ask "where did this number come from."
- CI on every PR.
What it doesn't
- Orchestration of non-SQL work (Airflow, Dagster, or Prefect fill that gap).
- Stream processing.
- Anything Python-heavy.
Experimentation platform
The most undervalued layer. Without it, every A/B test is a one-off spreadsheet — which means slow, inconsistent, and prone to error. The platform is what scales experimentation from "the DS team runs one a quarter" to "PMs run 5 a week."
Tradeoffs
- Statsig / Eppo: managed, good UX, fast time-to-value, but locked in.
- GrowthBook: open-source, self-hostable, less polished.
- Build it yourself: assignment via a hash on user_id, exposure logging through your event stream, analysis in dbt + a notebook. Cheapest, but you'll spend more on operations than you saved on license.
For a company at Series B–C, buy the managed platform. The DS team should be doing analysis, not building infra. Build-it-yourself makes sense at Series D+ when scale demands custom and you have the headcount to operate.
BI / serving
The serving layer is what non-technical users actually see. Three layers:
- Self-serve dashboards: Looker, Mode, Hex, Metabase. Pick the one matching your team's SQL fluency and budget.
- Embedded metrics in workflow tools: revenue per account in Salesforce; weekly engagement summaries in Slack; alerting on metric anomalies.
- Decision artifacts: one-pagers, monthly reviews, board decks. The work products that close decision loops.
Phased rollout
The answer to "what do you build first" is a sequence, not a stack diagram:
| Phase | What ships | Why this order |
|---|---|---|
| Month 1 | Event taxonomy + Segment-style ingestion + raw lands in BigQuery | Without events you have nothing. Get the data flowing first. |
| Month 2 | dbt repo with staging + first three marts (users, sessions, subscriptions) | Marts unblock analysts and the BI tool. |
| Month 3 | BI tool (Mode or Hex), first 5 dashboards, weekly review process | Stakeholders need a place to find numbers. |
| Month 4 | Managed experimentation platform (Statsig or Eppo) wired up | You can't ship the "we run rigorous A/B tests" promise without infra. |
| Month 5–6 | Metric layer (LookML, Cube, dbt semantic models); first propensity / forecasting models | Now you can prevent metric definitions from drifting across dashboards. |
A senior DS adds: "I'd verify the engineering team can actually emit the events the taxonomy assumes — that's the most common reason month 1 slips to month 2. I'd also socialize the metric definitions early, even before the warehouse is real, so we don't ship dashboards with definitions nobody agreed to."
Interview probes
Show probe 1: "We have nothing. Where do you start?"
"Event taxonomy and ingestion in month one. Without events, every downstream investment is on quicksand. I'd agree the taxonomy with PM and eng leads, pick Segment or Rudderstack for ingestion (we can build later if cost or scale demand it), and land everything raw in BigQuery or Snowflake. In parallel I'd start dbt with staging models so as soon as raw is flowing, we have clean marts. The first dashboard ships in month three."
Show probe 2: "Would you buy or build the experimentation platform?"
"At Series B–C, buy. Statsig or Eppo. The DS team's time is more valuable on analysis than on infrastructure. Build sense kicks in at Series D+ when scale justifies the headcount, or when you have specialized requirements (network-cluster randomization, switchback experiments, real-time bandit serving) that the managed platforms don't cover well."
Show probe 3: "What's the most common failure mode of an early data stack?"
Metric definitions drift across dashboards because nobody owns them. The fix is a metric layer — dbt semantic models, LookML, or Cube — that defines metrics in one place and serves them to every consumer. Without it, you spend year two reconciling 'why does this dashboard say 17% and that one say 19%' instead of building new things.
Show probe 4: "When do you NOT use dbt?"
When the transformation is fundamentally non-SQL — heavy Python feature engineering, streaming joins, ML feature stores. For SQL transformation over a warehouse, dbt is the right answer 95% of the time. The remaining 5% is when you've reached the scale where compiled SQL is too slow and you need a custom orchestration layer with parallelism dbt doesn't manage well — at which point you've outgrown dbt anyway.