Section D · Data

Data & Pipelines — PM Point of View

You don't build the pipelines. You define the event model, the identity resolution, and the data contracts. Get that wrong and the platform looks fine while the analytics layer slowly diverges from reality. This chapter is the PM-side fluency that keeps that from happening.

Why a PM owns the data shape — not "the data team"

In a regulated platform, the event model is the audit log. The audit log is a regulator deliverable. That makes the event model a product artifact, not a data-team byproduct.

Event schemas are part of the platform contract. Downstream teams build on them.
Identity resolution decisions (what a "person" is) are product decisions with operational consequences.
Source-of-truth disputes happen weekly. If you're not refereeing them, someone with less context is.
Privacy / retention rules are non-negotiable; the platform's data flows must respect them by construction.

You're not writing the DDL. You're writing the spec the DDL implements.

The onboarding event model

An event model is the canonical list of things-that-happen during onboarding, with their schemas. A good one is opinionated, stable, and shared.

Core event types you'd expect:

Event	When emitted
`applicant.created`	An applicant record is initialized; identity not yet established
`applicant.identified`	Sufficient identity collected to attach to a person_id (verified or pending)
`verification.session_started`	An IDV session is opened with a vendor
`verification.document_submitted`	Customer uploaded a document
`verification.liveness_completed`	Liveness check finished, pass/fail
`screening.completed`	Sanctions / PEP / adverse-media check finished
`decision.made`	The decision engine resolved to approve/refer/reject
`tier.granted`	An applicant moved into a tier
`tier.upgraded` / `tier.revoked`	Tier change events
`review.opened` / `review.closed`	Manual-review queue lifecycle
`rescreening.hit`	An existing customer surfaced a new hit
`policy.changed`	A policy document was updated and signed off

Every event has the same envelope (event_id, type, schema_version, occurred_at, actor, correlation_id, causation_id) plus a typed payload.

Stability is a feature

An event model that changes shape every quarter is unusable for analytics, compliance, and downstream teams. Treat schemas like APIs — additive changes free, breaking changes versioned and migrated.

Funnel instrumentation — get it right once

The funnel SQL you write is only as good as the event stream behind it. Common instrumentation traps:

Missing events. "id_submitted" only fires on success — so failures don't appear in the funnel, and abandonment looks like submission.
Duplicate events. Same event fires from client and server, you count twice.
Late events. Mobile clients buffer and send late, skewing time-bucketed analysis.
Inconsistent identity. Pre-auth events tagged with a session ID, post-auth events with a user ID, no join key.
Definition drift. "verified" means one thing in March, something else in June, because a code change quietly changed the trigger.

Platform PM remediation:

Server-emit events from the platform itself, not the client, where possible. Single source of truth.
Idempotency keys on events; deduplicate at ingestion.
Document each event's emit condition in code comments AND in a public event catalog.
Lock event-emit conditions behind code-review with the data team subscribed.
Use schema validation at ingestion; reject malformed events to a quarantine instead of best-effort ingest.

Identity-event firehose

The firehose is the unified stream of every identity-related event, consumable by any downstream team that has a legitimate need.

Typical consumers:

Compliance reporting — daily / monthly summaries, regulator queries.
Fraud / risk — feature engineering for ML models.
Customer support — show the rep the journey when a customer calls.
Growth analytics — funnels, cohorts, experiments.
Operations dashboards — queue health, vendor health.

Implementation usually: Kafka or Kinesis topic → mirrored to the warehouse (Snowflake, BigQuery) for analytic use, with a stream-consumer for real-time uses. The PM doesn't pick the queueing tech; the PM does specify the schema, the retention, the access controls, and the SLA.

user_id vs applicant_id vs person_id — why this matters

Three identifiers serving three different needs. Conflating them is the most common source of analytics bugs in onboarding.

ID	What it identifies	Lifecycle
`session_id`	A single browser/app session	Hours to days
`applicant_id`	A particular onboarding attempt (segment × product × time)	Per-attempt; may have multiple per person
`user_id` / `account_id`	An application account	Lifetime of the account
`person_id`	The resolved real-world person/entity	Stable; bridges accounts and applicants

Implications:

"Number of unique users who completed KYC" is ambiguous — by which id?
Sanctions screening operates at person_id so one hit triggers across all accounts.
Resume after abandonment requires linking a new session to a prior applicant_id, which requires either auth or a verified identity match.
Identity resolution is its own product — false-merge has compliance implications (mixed-up records), false-split has UX implications (the customer re-does KYC unnecessarily).

Data contracts with growth/marketing

Marketing and growth teams need access to onboarding data — to attribute, to retarget, to measure. But these teams operate under different privacy rules than the platform.

Operational principles:

Data contract — an explicit doc with the schema, the cadence, the consumer team, the use case, the privacy basis. Versioned.
Granularity controls — marketing may need cohort-level aggregates, not row-level PII. Default to the minimum.
Suppression discipline — opt-out, region-locked (EU residents excluded from US-tool flows by default), and consented-purpose are checked at the contract layer, not in downstream queries.
Audit — log who consumed what and when. If a regulator asks "who saw this customer's data," you have an answer.

The PII-leakage trap

Marketing tags fire on the success page after KYC. They send query parameters to a third-party tag manager. The query parameters include the applicant ID. Three years later, this is a GDPR finding. Audit the client-side instrumentation, not just the server-side.

Source-of-truth disputes

"Activation rate was 41% in your dashboard but 38% in mine." This will happen. Your job is to make the source-of-truth question unambiguous before someone presents conflicting numbers to a VP.

Defensive moves:

One canonical definition per metric, owned by a named person (often you).
Metric registry — the metric name, its SQL, its owner, its known caveats. Lives in code, not a wiki page.
Modeled tables, not raw events, for dashboards. The transformation lives in dbt (or similar); BI tools query the modeled view.
Quarterly metric review — walk through each headline metric with eng + DS + Growth; surface drift.
"Why are these numbers different?" runbook — a documented process to triage discrepancies before they become political.

Privacy & retention — the part you can't get wrong

You're handling government IDs, biometric data, financial profiles. The privacy regime is non-negotiable.

Data minimization — collect only what the policy says you need. Extra fields are liability, not optionality.
Encryption — at rest, in transit, with per-applicant or per-tenant keys where the system supports it.
Retention — different data classes have different retention. AML data: typically 5-7 years post-account-closure. Marketing data: typically shorter. Biometric template data: regulated separately in some jurisdictions (e.g. Illinois BIPA).
Erasure — GDPR erasure requests must be honored where AML retention obligations don't override. The platform must know which fields are which.
Cross-border — EU customer data going to US-hosted services needs an adequate-protection basis (SCCs, etc.).
Vendor data flows — every IDV vendor receives PII. The data processing agreement governs what they can do with it, where they store it, how they delete it.

PM checklist for the data layer

Have I named the canonical event model and published it?
Are events server-emitted and idempotent?
Is the identity hierarchy (session / applicant / user / person) documented with a single resolution authority?
Are key metrics defined in a registry with named owners?
Is the firehose retention aligned to the strictest applicable regulatory window?
Are downstream data contracts explicit, versioned, and audited?
Are privacy considerations encoded in the contract layer, not the consumer's discipline?
Can I show a regulator a single applicant's full event chain in < 5 minutes?

If you can say yes to all of these, the data layer is a platform asset. If not, it's a liability with a query interface.