Data & Pipelines
A PM-eye view of payments data — events, tokenization, reconciliation, ledgers. You don't build this; you must understand it well enough to spec instrumentation and to read what comes out the other end.
The payments event model
Payments are best modeled as an append-only event stream around a stable identity — the payment intent. Common events per intent:
| Event | When emitted |
|---|---|
payment_intent.created | User initiates a deposit |
payment_intent.requires_action | 3DS challenge required |
auth.attempted | Request sent to PSP |
auth.approved / auth.declined | PSP responds |
auth.cascaded | Failover to alternate PSP |
capture.requested / capture.completed | If auth/capture is split |
settlement.scheduled / settlement.completed | Funds moved by acquirer |
refund.requested / refund.completed | Refund flow |
dispute.opened / dispute.evidence_submitted / dispute.resolved | Chargeback lifecycle |
ledger.posted | Internal ledger entry written |
Every event carries the stable payment_intent_id plus a monotonic sequence. Downstream consumers reconstruct state by replaying events.
Tokenization — where raw card data ends and tokens begin
From a data-pipeline perspective:
- Raw PAN enters at the edge (iframe / SDK / hosted-fields).
- Vault receives it; returns a vault token (internal).
- Vault token may map to a scheme-issued network token (VTS/MDES) used downstream.
- Everything after the vault uses tokens, never PAN. PCI scope is the vault and its callers, only.
- Analytics data should never have PAN. BIN is fine (first 6-8 digits); full PAN never lands in your warehouse.
Auth and settlement events — they live in different streams
Two distinct sources:
- Authorization stream — synchronous, from PSP API responses. Real-time, per attempt.
- Settlement stream — asynchronous, from PSP daily files / SFTP / API pulls. Per batch.
These do not arrive at the same time. Authorization is realtime. Settlement files often arrive T+1 morning. Reconciliation joins the two streams on intent ID and amount. Mismatches indicate problems.
Sources of truth — there are several
| Source | What it knows | Latency |
|---|---|---|
| PSP API responses | Auth attempt outcome | Realtime |
| PSP settlement report | Funds actually moved | T+1 typical |
| Acquirer settlement file | Funds reaching the acquirer | T+1 |
| Bank statement | Funds in your nostro | T+1 / EOD |
| Internal ledger | Customer balance state | Realtime (post-event) |
| Card scheme reports | Dispute / chargeback events | Days-weeks |
When sources disagree, your job is to define which wins for which question:
- "Was the customer credited?" → internal ledger.
- "Did the acquirer get paid?" → settlement report.
- "Did the funds hit our bank?" → bank statement.
- "Is the customer about to chargeback?" → scheme report.
Daily reconciliation cycle
Every morning, the recon job runs:
- Pull yesterday's PSP settlement files (SFTP, API, or via aggregator).
- Parse and normalize into the canonical settlement event schema.
- Join to internal auth events on intent_id + amount.
- Compute three buckets: matched, mismatched, missing.
- Surface to finance / treasury for human review.
- Auto-resolve patterns we know (FX rounding deltas, fee adjustments).
- Open tickets for genuine mismatches.
PM lever: tighten the matching rules; reduce the manual queue; report weekly on "% of settlement auto-matched."
Cross-border reconciliation — the harder version
Same flow, plus:
- FX conversion at one or more legs — your quoted rate vs PSP's actual rate.
- Currency-of-record at multiple ledgers — customer-currency, settlement-currency, base-currency.
- Timezone misalignment — your "yesterday" and the partner's "yesterday" may differ.
- Multiple correspondent banks for SWIFT corridors — each takes a cut.
Senior pattern: shadow ledger in your base currency. Reconcile actual vs expected daily, route deltas to a "FX P&L" account.
Why payments data is messy
Tell-tale signs you should be prepared to discuss:
- Schema drift — each PSP has its own fields, casing, decline-code dictionary. Normalization is constant.
- Late events — settlements arrive T+N. Disputes arrive weeks later. Your "today's numbers" mutate.
- Idempotency artifacts — same event delivered twice; consumers must dedupe.
- Out-of-order events — webhook delivery is not ordered; you can see
capturedbeforeauthorized. - Currency rounding — minor units differ (JPY has no decimals; BHD has 3).
- Partial refunds and disputes — multi-row events against one txn.
- Versioning — your decline-family taxonomy will evolve; analytics must handle historical breaks.
Warehouse vs operational stores
Two different data surfaces:
- Operational — Postgres / Cassandra-like. Realtime; low latency; supports the auth path. Per-record reads.
- Warehouse — Snowflake / BigQuery / Redshift. Batch loaded from operational + event stream. Analytics queries.
For a PM: most of your analysis lives in the warehouse. Cohort cuts, AAR investigations, vendor scorecards. The operational store is for the real-time path. Don't confuse them in interview.
What a PM asks of the data pipeline
- Every auth attempt logged with the canonical field set (see 07).
- One stable identity (
payment_intent_id) across attempts, captures, refunds, disputes. - Late-arriving events handled — yesterday's numbers can update, but with audit.
- Settlement events joined to auth events automatically; mismatches alerted.
- Dispute events linked to the underlying intent for representment context.
- Warehouse refresh SLA — analytics should be at most a few hours stale, in most cases.
- PII handled — no PAN in warehouse; BIN OK; tokenized identifiers only.