Launch & Ops — PM Point of View
How a Platform PM ships a change without breaking things — staged rollout, vendor health, on-call partnership, hypercare, rollback authority. In a regulated platform, every launch is also a compliance event; the playbook reflects that.
The launch playbook for platform changes
A platform launch isn't a marketing event. It's a graduated risk-acceptance ceremony with named decision-makers and named rollback paths. The shape of a serious launch:
- Pre-launch readiness review. A week before. Eng, Compliance, Ops, on-call rep, customer support. The checklist below.
- Internal canary. Deploy to staff or a 1% subset. 24-48h soak.
- Geo canary. One lower-risk market, 100% of new traffic. 1-2 weeks.
- Staged ramp. Next markets, in waves. Compliance and Growth pre-aware of each.
- Hypercare. 2-4 weeks of elevated monitoring after full ramp.
- GA declaration. Explicit; on-call returns to baseline; old path deprecated on a schedule.
- Postmortem (even if it went well). What did we learn? What's the standing change?
Pre-launch readiness checklist
- Rollback plan in writing, tested in staging.
- Feature-flag covers the change; off-state is unambiguously the prior behavior.
- Monitoring dashboards for the change live, with thresholds defined.
- On-call rep informed; runbook updated; pager paths confirmed.
- Compliance sign-off on the change classification + scope.
- Customer-support FAQ updated; common cases addressed.
- Comms to internal stakeholders queued — status changes posted as they happen.
- Vendor partners aware (if applicable).
- Audit-log emission verified end-to-end for the new flow.
- Test data scenarios covered in sandbox.
Staged geo rollout
Stage by jurisdiction, not by % of global traffic. Why: jurisdictions differ enough that a bug specific to German residence-permit handling won't show up in an Australia-heavy canary. Order by:
- Risk: lower-risk geos first.
- Volume: low-volume geos first so an issue affects fewer customers.
- Regulatory tolerance: some regulators are tolerant of canary-style rollouts; some expect notice. Know which is which.
- Operational readiness: Ops is staffed local-language for the canary geo.
Each geo has gating criteria: 7 days at full volume + sanctions-hit-rate normal + complaint-rate within band + decision-latency within SLA. Only then ramp next.
Vendor health monitoring
Vendors fail. The platform's job is to detect and route around faster than the customer notices.
What to monitor:
- Success rate — % of vendor calls returning a decision. Watch p50 and rolling-window.
- Latency p95 / p99 — slow vendors hurt activation even when "up."
- Error-type distribution — a spike in a specific error code is signal before the overall rate moves.
- Decision-quality drift — sudden change in reject rate or sudden change in score distribution is usually a vendor model update.
- Public status pages — subscribe to all vendor status pages; integrate into your dashboard.
When monitoring triggers: automated failover (route to backup vendor); human alert (page on-call); status post to internal stakeholders; vendor-side ticket / call.
On-call partnership — the PM's role
You're not on the pager. But the on-call eng's job is harder when the PM is absent. Specific moves:
- Co-author runbooks. The runbook for "vendor X is down" should be written by PM + eng together, with PM owning the customer-facing decisions and eng owning the technical response.
- Decision authority pre-delegated. "If failover doesn't recover in 30 minutes, on-call has authority to send the broadcast email" — pre-agreed, not litigated at 3am.
- Be available during launches and major changes. Be reachable, not asleep. Hypercare = PM hypercare too.
- Post-incident: write the user-facing summary. Don't make the eng do customer-comms.
- Run the postmortem. Schedule, facilitate, write up. This is the PM's job.
The hypercare period
After full ramp, the platform is at peak risk. Hypercare = elevated attention for a defined window. Typically 2-4 weeks for a major change.
- Daily check-ins with eng on dashboard health.
- On-call has explicit "during hypercare, page on lower threshold" instructions.
- Customer support has a hotline for hypercare-period escalations.
- Compliance gets a weekly summary of state during hypercare.
- Rollback authority remains delegated; rolling back during hypercare is normal, not failure.
Hypercare ends with an explicit declaration: dashboards within band, no open hot tickets, postmortem complete.
Declaring GA — what changes
General Availability is the moment the old path can be deprecated. Until then, both paths run.
- SLA shifts to standard (vs hypercare).
- Old path enters formal deprecation: announced, instrumented for usage, sunset date set.
- Migration of last consumer teams scheduled.
- Final postmortem and learnings written up — what would we do differently for the next launch?
- Documentation updated to drop "beta" or "preview" language.
GA is a deliberate decision, not a calendar event. Don't drift into it.
Rollback authority — who pulls the trigger
Rolling back is a culture question disguised as a technical question. The platform that's allowed to roll back without political friction recovers faster.
A workable model:
- Eng on-call has authority to revert a deploy any time during hypercare without PM sign-off.
- PM + eng on-call jointly can disable a feature flag during business hours.
- PM + Compliance jointly can roll back a policy change.
- VP level required only for cross-team rollbacks that disrupt other product launches.
Document the authority matrix. The worst time to debate who can roll back is during a P1.
If rolling back is technically expensive — a 6-hour data migration, a multi-system coordination — you've already chosen a deployment shape that punishes safety. Invest in cheap rollback; it pays for itself the first incident.
The Friday-deploy rule in regulated products
Default: no Friday deploys for the platform. Reasons specific to regulated onboarding:
- Reduced staffing over the weekend means slower detection and slower response.
- Customer-support escalations over the weekend lack PM/eng presence.
- Compliance counterparts are not on call; if a regulator-notification trigger fires, you can't reach them.
- Weekend traffic patterns are different — bugs that hide in weekday mix can surface.
Exception process: emergency fixes, hot patches, and clearly-scoped low-risk changes can ship Friday with explicit PM + eng sign-off. The default-no isn't a hard ban; it's a friction.
Ops tooling is a PM deliverable
Ops manages the manual-review queue, the customer-escalation pipeline, the disposition workflows. Their tooling is part of the platform.
- Review console — surface evidence, allow comparison, one-click dispositions.
- Audit search — query an applicant's full history quickly.
- Bulk actions — re-screen a cohort, re-process a queue.
- Quality controls — sampling, second-reviewer flow, escalation paths.
- Reporting — queue depth, time-in-queue, dispositioning rate, by reviewer.
If Ops can't do their job in your tooling, they'll do it in spreadsheets. Spreadsheets aren't auditable. Treat Ops as a first-class customer of the platform.
Status page & internal comms
The platform owes its downstream teams visibility. Even when no incident is in progress:
- Internal status page — current platform health, current vendor health, ongoing changes.
- Change calendar — what's deploying this week, by class.
- Postmortem archive — readable summaries, action items, learnings.
- Quarterly platform review — what shipped, what didn't, what's next.
The platform that communicates well has loyal customers. The one that goes silent and surprises people loses adoption.