Section D · Production

Launch & Ops

From signed partner contract to first successful production payment, to hypercare to BAU. The launch playbook for a new rail in a new market — and the awkward truth that you can't really "roll back" a settled payment.

The new-rail launch playbook

The interview answer when asked "how would you launch UPI in India?"

Partner selected & contracted — RFP done, MSA + SLA + DPA in place, kickoff scheduled.
Compliance pre-review — sanctioning regime, licensing posture, AML obligations confirmed.
Integration in sandbox — engineering builds the integration; runs partner test suite.
End-to-end test plan — happy paths, soft decline, hard decline, timeout, 3DS, refund, dispute.
Compliance signoff — formal review of flows, monitoring rules, data flows.
Limited production canary — internal employees only; real money, real rails.
Whitelist rollout — a tiny external cohort.
% rollout — 1%, 5%, 25%, 50%, 100%, with health gates between stages.
Hypercare — 2-4 weeks of elevated monitoring, daily standups, on-call partnership with eng.
BAU handoff — sustaining mode; metrics in dashboard; weekly review cadence.

Partner onboarding

Often the slowest stage, easily 6-12 weeks.

KYB diligence on partner (and they on you).
Legal: MSA, DPA (data-processing agreement), sub-processor list, IP, indemnity.
Information-security review: SOC 2 / ISO 27001 evidence, penetration-test results, incident response, sub-processor controls.
Sanctions / FinCrime alignment — Travel Rule for crypto rails specifically.
Technical onboarding: credentials, sandbox access, webhook endpoints, IP allowlists.
Roles: who's the partner's account director, technical contact, compliance contact, on-call?
Cadence: monthly business review template agreed in writing.

Sandbox testing

What the test plan covers — the interview-grade list:

Happy-path auth + capture + settlement.
Soft decline → cascade.
Hard decline → no retry; UX message.
Timeout / network failure → idempotent retry.
3DS frictionless and challenge.
Partial refund; full refund; double refund attempt (idempotency).
Dispute open / evidence submit / win / lose.
Recurring mandate creation + execution + cancellation.
Rail-specific: PIX MED return; UPI mandate failure; SEPA recall.
Webhook replay; out-of-order delivery.
Settlement file ingestion (real format, edge cases).
Load / soak / chaos — degraded partner; recover.

Compliance signoff — what they actually review

Data flow diagram — where data goes, who processes it, where it's stored, retention.
Sanctions screening points — is every counterparty checked?
AML rule integration — does this new rail emit events into transaction monitoring?
Travel Rule application — for crypto-touching flows.
Customer disclosures — fee, FX, risk warnings as required.
Records retention — do logs survive the required window?
Incident response — who is paged, what is reported, what's the regulator notification trigger?

Compliance signoff is not "rubber stamp." Treat your compliance partner as part of the launch team from week one.

Staged rollout — the percentage gates

Each gate has a health checklist before progressing:

Gate	Population	Health checks
Canary	Internal employees (10s)	End-to-end works at all
Whitelist	Invited cohort (100s)	AAR ≥ target floor; no edge-case incidents
1%	Random sample (1000s)	AAR steady; cost-per-success in band; no support spike
5% / 25% / 50% / 100%	Progressively wider	Each gate held for ≥ 24-72h; same checks tighten thresholds

Gate decisions are explicit: who calls "go," who calls "hold," what data we look at, what time of day the call is made.

Launch-day dashboard

Build it before launch. Don't watch raw logs. Top-line panels:

Volume — intents/min on the new rail.
AAR — first-attempt and final, rolling 15-minute window.
Latency — p50/p95/p99 auth response.
Errors — partner-side error code distribution.
Support — contact rate per 1k attempts.
Fraud signals — early-warning score.
Compare panel — new rail vs incumbent rail in same geo, same hour.

Hypercare

2-4 weeks where the bar for normal is dialed up: daily standup with engineering, payments-ops on-call, partner on a shared channel, weekly business review with the partner. Exit hypercare on a written checklist — not just "feels stable."

On-call partnership with payments engineering

PM is rarely the engineering on-call. But you are the customer's on-call — communicating with leadership, with support, with the partner. Your job during an incident:

Run the comms — status page, internal Slack, exec briefing.
Make the call on customer-facing UX during degradation (e.g. should we hide the rail temporarily?).
Coordinate with the partner's incident channel.
Track post-mortem actions through to landing.

Rollback in payments — the awkward truth

You can roll back a deployment of your code. You cannot roll back a settled rail. Once funds have moved, they have moved. Implications:

Feature-flag every new flow. Disable cleanly at the routing layer, not just in UI.
Hold-and-release for first-of-its-kind rails — give yourself a safety window where customer can't immediately withdraw.
Compensation playbook, not rollback — if you misroute funds, you publicly refund + apologize, not rewind.
Customer comms — when a launch goes sideways, customers see a number that won't change. Honest, fast comms win.
Reconciliation hold — pause withdrawals tied to the new rail until you can reconcile.

The senior phrasing

"In payments, 'rollback' really means 'feature-flag off and compensate.' I'd design the flag and the compensation path before I'd ship the rail."