Section D · Launch & Ops

Launch & Ops — PM Point of View

How a Platform PM ships a change without breaking things — staged rollout, vendor health, on-call partnership, hypercare, rollback authority. In a regulated platform, every launch is also a compliance event; the playbook reflects that.

The launch playbook for platform changes

A platform launch isn't a marketing event. It's a graduated risk-acceptance ceremony with named decision-makers and named rollback paths. The shape of a serious launch:

  1. Pre-launch readiness review. A week before. Eng, Compliance, Ops, on-call rep, customer support. The checklist below.
  2. Internal canary. Deploy to staff or a 1% subset. 24-48h soak.
  3. Geo canary. One lower-risk market, 100% of new traffic. 1-2 weeks.
  4. Staged ramp. Next markets, in waves. Compliance and Growth pre-aware of each.
  5. Hypercare. 2-4 weeks of elevated monitoring after full ramp.
  6. GA declaration. Explicit; on-call returns to baseline; old path deprecated on a schedule.
  7. Postmortem (even if it went well). What did we learn? What's the standing change?
Pre-launch readiness checklist
  • Rollback plan in writing, tested in staging.
  • Feature-flag covers the change; off-state is unambiguously the prior behavior.
  • Monitoring dashboards for the change live, with thresholds defined.
  • On-call rep informed; runbook updated; pager paths confirmed.
  • Compliance sign-off on the change classification + scope.
  • Customer-support FAQ updated; common cases addressed.
  • Comms to internal stakeholders queued — status changes posted as they happen.
  • Vendor partners aware (if applicable).
  • Audit-log emission verified end-to-end for the new flow.
  • Test data scenarios covered in sandbox.

Staged geo rollout

Stage by jurisdiction, not by % of global traffic. Why: jurisdictions differ enough that a bug specific to German residence-permit handling won't show up in an Australia-heavy canary. Order by:

  • Risk: lower-risk geos first.
  • Volume: low-volume geos first so an issue affects fewer customers.
  • Regulatory tolerance: some regulators are tolerant of canary-style rollouts; some expect notice. Know which is which.
  • Operational readiness: Ops is staffed local-language for the canary geo.

Each geo has gating criteria: 7 days at full volume + sanctions-hit-rate normal + complaint-rate within band + decision-latency within SLA. Only then ramp next.

Vendor health monitoring

Vendors fail. The platform's job is to detect and route around faster than the customer notices.

What to monitor:

  • Success rate — % of vendor calls returning a decision. Watch p50 and rolling-window.
  • Latency p95 / p99 — slow vendors hurt activation even when "up."
  • Error-type distribution — a spike in a specific error code is signal before the overall rate moves.
  • Decision-quality drift — sudden change in reject rate or sudden change in score distribution is usually a vendor model update.
  • Public status pages — subscribe to all vendor status pages; integrate into your dashboard.

When monitoring triggers: automated failover (route to backup vendor); human alert (page on-call); status post to internal stakeholders; vendor-side ticket / call.

On-call partnership — the PM's role

You're not on the pager. But the on-call eng's job is harder when the PM is absent. Specific moves:

  • Co-author runbooks. The runbook for "vendor X is down" should be written by PM + eng together, with PM owning the customer-facing decisions and eng owning the technical response.
  • Decision authority pre-delegated. "If failover doesn't recover in 30 minutes, on-call has authority to send the broadcast email" — pre-agreed, not litigated at 3am.
  • Be available during launches and major changes. Be reachable, not asleep. Hypercare = PM hypercare too.
  • Post-incident: write the user-facing summary. Don't make the eng do customer-comms.
  • Run the postmortem. Schedule, facilitate, write up. This is the PM's job.

The hypercare period

After full ramp, the platform is at peak risk. Hypercare = elevated attention for a defined window. Typically 2-4 weeks for a major change.

  • Daily check-ins with eng on dashboard health.
  • On-call has explicit "during hypercare, page on lower threshold" instructions.
  • Customer support has a hotline for hypercare-period escalations.
  • Compliance gets a weekly summary of state during hypercare.
  • Rollback authority remains delegated; rolling back during hypercare is normal, not failure.

Hypercare ends with an explicit declaration: dashboards within band, no open hot tickets, postmortem complete.

Declaring GA — what changes

General Availability is the moment the old path can be deprecated. Until then, both paths run.

  • SLA shifts to standard (vs hypercare).
  • Old path enters formal deprecation: announced, instrumented for usage, sunset date set.
  • Migration of last consumer teams scheduled.
  • Final postmortem and learnings written up — what would we do differently for the next launch?
  • Documentation updated to drop "beta" or "preview" language.

GA is a deliberate decision, not a calendar event. Don't drift into it.

Rollback authority — who pulls the trigger

Rolling back is a culture question disguised as a technical question. The platform that's allowed to roll back without political friction recovers faster.

A workable model:

  • Eng on-call has authority to revert a deploy any time during hypercare without PM sign-off.
  • PM + eng on-call jointly can disable a feature flag during business hours.
  • PM + Compliance jointly can roll back a policy change.
  • VP level required only for cross-team rollbacks that disrupt other product launches.

Document the authority matrix. The worst time to debate who can roll back is during a P1.

Make rollbacks cheap

If rolling back is technically expensive — a 6-hour data migration, a multi-system coordination — you've already chosen a deployment shape that punishes safety. Invest in cheap rollback; it pays for itself the first incident.

The Friday-deploy rule in regulated products

Default: no Friday deploys for the platform. Reasons specific to regulated onboarding:

  • Reduced staffing over the weekend means slower detection and slower response.
  • Customer-support escalations over the weekend lack PM/eng presence.
  • Compliance counterparts are not on call; if a regulator-notification trigger fires, you can't reach them.
  • Weekend traffic patterns are different — bugs that hide in weekday mix can surface.

Exception process: emergency fixes, hot patches, and clearly-scoped low-risk changes can ship Friday with explicit PM + eng sign-off. The default-no isn't a hard ban; it's a friction.

Ops tooling is a PM deliverable

Ops manages the manual-review queue, the customer-escalation pipeline, the disposition workflows. Their tooling is part of the platform.

  • Review console — surface evidence, allow comparison, one-click dispositions.
  • Audit search — query an applicant's full history quickly.
  • Bulk actions — re-screen a cohort, re-process a queue.
  • Quality controls — sampling, second-reviewer flow, escalation paths.
  • Reporting — queue depth, time-in-queue, dispositioning rate, by reviewer.

If Ops can't do their job in your tooling, they'll do it in spreadsheets. Spreadsheets aren't auditable. Treat Ops as a first-class customer of the platform.

Status page & internal comms

The platform owes its downstream teams visibility. Even when no incident is in progress:

  • Internal status page — current platform health, current vendor health, ongoing changes.
  • Change calendar — what's deploying this week, by class.
  • Postmortem archive — readable summaries, action items, learnings.
  • Quarterly platform review — what shipped, what didn't, what's next.

The platform that communicates well has loyal customers. The one that goes silent and surprises people loses adoption.