Error Handling & Failure Modes — Platform PM Edition
What kills platform launches in regulated onboarding — the human, organizational, vendor, and technical failure modes a senior PM should reach for before "what's cool." And the recovery primitives — rollback, kill-switches, partial-launch carve-outs — that turn a bad day into a contained incident.
What actually kills platform launches
Six failure modes account for almost every postmortem in this domain. Interviewers want you to reach for these before "the eng didn't ship in time."
- Compliance veto, late.
- Regulator change mid-build.
- Vendor outage during rollout.
- Downstream-team adoption resistance.
- Scope creep from feature teams demanding bespoke flows.
- Partial regression — old flow broke for a subset of customers.
Each one has a prevention discipline and a recovery primitive. Treat the list as a checklist before any major platform launch.
Failure mode 1 — Compliance veto, late
Scenario: You're three weeks from launch. Compliance reviews the final flow, says the consent capture is insufficient for the German market. Slip.
Why it happens: Compliance was looped in too late, or looped in but didn't have decision authority, or had decision authority but didn't review at enough depth.
Prevention:
- Compliance review is a scheduled deliverable at the strategy stage, the design stage, and the pre-launch stage. Not just pre-launch.
- Define what counts as "policy-relevant change" in advance. Any flow change touching consent, screening, tier requirements, or jurisdictional rules requires compliance sign-off.
- The platform's CI runs a "policy diff" — if a flow is being launched in a market, show the diff vs the canonical policy for that market.
- Designate a Compliance counterpart per platform initiative; not "Compliance" as an abstract team.
Recovery: Have a "carve-out launch" pattern — launch in lower-risk jurisdictions first, with the problematic geo deferred behind a feature flag. Better a 90% launch than a 0% launch.
Failure mode 2 — Regulator change mid-build
Scenario: You're four months into a 6-month build. FATF issues new guidance on Travel Rule data. Or BaFin changes the documentation requirement for proof-of-address. Your in-flight design is now non-compliant.
Why it happens: Regulatory environment is genuinely dynamic. You can't fully prevent it, but you can absorb it.
Prevention:
- Architecture treats jurisdictional policy as data, not code. The smaller the surface that's hardcoded to a specific rule, the cheaper the absorption.
- Subscribe to regulatory updates (typically via Compliance's tooling); build a "regulatory horizon" review into the quarterly cycle.
- Stage longer in-flight builds — every 90 days, ship something. A 12-month silent build has no chance against regulatory change.
Recovery:
- Triage: does the change apply retroactively (existing customers must re-verify) or only to new (forward-only)?
- If retroactive: cohort the affected customers, build a step-up flow, schedule a campaign with Compliance and Ops.
- If forward-only: update the policy document, deploy with effective-date.
Failure mode 3 — Vendor outage during rollout
Scenario: You roll out the new platform to 100% of new sign-ups in Germany. That afternoon, Persona has a 4-hour incident affecting EU traffic. New sign-ups can't complete verification.
Why it happens: Vendors fail. You knew that intellectually. Now it's Tuesday.
Prevention:
- Vendor abstraction layer with active or warm failover. The waterfall is configured before you need it.
- Vendor health monitoring with automated failover thresholds. Not "page a human and they decide;" the system decides for known signatures and notifies the human.
- SLO budget — the platform has an availability SLO that includes vendor dependence; failing-over is part of meeting it.
Recovery:
- Failover to backup vendor automatically; ops alerted; status page updated.
- If failover is partial (backup doesn't support every doc type), the unsupported tail queues for retry once primary recovers, with the customer notified.
- Postmortem includes vendor SLA enforcement — they owe you credits or escalation.
Failure mode 4 — Downstream-team adoption resistance
Scenario: The platform is built and works. But Futures, Margin, and Pro all keep their own forked flows. Six months in, adoption is at 20% and you can't justify the headcount.
Why it happens: Adoption is a sale. You skipped the sale.
Prevention:
- Anchor customers — name 2-3 downstream teams as the first wave; co-design with them; their launch is your launch.
- Migration cost subsidy — embed platform eng with the first migrators. Don't ask them to bear the cost alone.
- Make the bespoke path slightly worse — once the platform is good, decline to maintain forks; quietly let them rot.
- Public roadmap with named launches per platform-team partner; visible to leadership.
Recovery if you're already here:
- Diagnostic: is the platform good enough? If not, fix that first.
- If yes: explicit migration plan with deadlines, eng support, and exec air cover.
- Sometimes: pick the team with the most pain, white-glove their migration, make it a case study.
Failure mode 5 — Scope creep from feature teams demanding bespoke
Scenario: Institutional wants a "small tweak" — actually a custom doc type, custom decision rules, custom escalation. You said yes once; now you're saying yes to every team and the platform is a kit of forks.
Prevention:
- Public intake with criteria: every request answers "is this a platform primitive (build it) or a team-specific configuration (you build it on top) or a one-off (we decline)?"
- The "two-customer rule" — generalize a primitive only when at least two teams want it. The first team builds in their own space; the second team's ask is when we extract.
- Configuration over customization — most asks resolve to "you can already do this if you configure X."
Recovery: If you've already accumulated forks, run a forks-audit and a paydown plan. Each fork gets a sunset date. Communicate transparently with affected teams.
Failure mode 6 — Partial onboarding regression
Scenario: You deploy a refactor. New sign-ups work fine. But customers who had abandoned mid-flow two weeks ago, returning to resume, hit an empty state. You don't notice for 48 hours.
Prevention:
- Test scenarios cover every state transition, including resume, retry, and step-up — not just happy path.
- Production canary by cohort, not just by % — explicitly include "users in mid-flow state" in canary observation.
- Monitoring covers transition counts (e.g. "resumes per hour"), not just totals.
Recovery:
- Rollback the deploy or feature-flag-off for the affected state.
- Identify affected applicants; if they were rejected by the regression, restore their state and reach out.
- Postmortem captures the missing test scenario; it becomes a permanent guardrail.
Recovery primitives — what the platform must support
You can't prevent every failure. You can architect the platform so recovery is fast and contained. Six primitives:
| Primitive | What it does |
|---|---|
| Rollback | Revert a code deploy in < 10 minutes; revert a policy change in seconds |
| Kill-switch per geo / segment | Disable the platform for one country or segment without taking everything down |
| Kill-switch per vendor | Stop routing to a misbehaving vendor immediately |
| Feature flags per flow | Disable individual flow steps without redeploy |
| Manual-mode fallback | Route all decisions to ops queue when automated decisioning is suspect |
| Audit replay | Re-run a decision against a different policy version to assess impact of a hypothetical change |
Build these before you need them. The platform that ships with kill-switches survives its first incident; the one that doesn't, doesn't.
Postmortem discipline
Every meaningful incident gets a postmortem. The shape:
- Timeline — what happened, when, who noticed, how.
- Impact — which customers, which downstream teams, regulatory exposure.
- Root cause — technical and organizational. Both.
- What worked — the monitors that fired, the responses that helped.
- What didn't — silent failure modes, slow detection, missing playbooks.
- Action items — owner, due date, classification (preventive / detection / response).
- Compliance implications — was there a regulatory-notification trigger? Was an audit log gap exposed?
Postmortems are not blame artifacts. They're capability multiplication — every incident makes the team a little harder to break the same way again. In a regulated context, this is also evidence of operational maturity if a regulator asks.
Avoid Friday deploys for regulated platforms. Not a hard rule, but the default. The cost of a Friday-deploy incident is 48 hours of customer harm with reduced staff. Save aggressive launches for Tuesday-Wednesday.