Go-Live Runbook
The literal checklist for flipping the switch. Day -7 through Day +30. On-call rotation, dashboards, rollback triggers and procedure, and the post-launch monitoring rhythm.
On-call & comms setup
Set this up the week before, not the morning of. Have a named person at every position; an "owner: TBD" line on launch day is how launches fail.
The launch roster
| Role | Who | Primary responsibility |
|---|---|---|
| Launch lead | PM owning the project | Final call on go / no-go / rollback. Single decision-maker. |
| Tech on-call (primary) | Eng lead on integration | Triage technical issues; can flip rollout pct |
| Tech on-call (secondary) | Backup engineer | Take over if primary unavailable; ≤30 min response |
| Compliance on-call | MLRO / CCO delegate | Audit-trail review; regulator-facing if needed |
| Customer ops lead | Support manager | Surface user signals not in metrics; staff queue |
| Vendor TAM | Vendor's account-side person | Backchannel to vendor engineering during incident |
| Comms / PR | Comms lead | External messaging if launch becomes user-visible |
Channels
- Working channel (e.g.,
#idv-launch): everyone above + observers. All decisions logged here. - Vendor shared channel (Slack Connect / Teams): direct line to vendor engineering.
- Incident channel: existing
#incidentsor equivalent for P1. - Decision log: pinned doc, one row per decision (timestamp, decision, decider, rationale).
Pin the exact command / config-flag flip to roll back to 0% at the top of the launch channel. If you're improvising it at 3 AM, you've lost.
Day -7 — Pre-launch checklist
One week out. Everything below should be done before Day 0 morning. If anything's still red on Day -3, push the launch.
Technical readiness
- Vendor integration in production environment, dark-launched for ≥ 7 days
- Vendor-agnostic abstraction layer in place (no vendor SDK leak into domain code)
- Rollout config flag wired to feature-flag service, tested both directions
- Webhook endpoint receiving production webhooks, signed-payload verification working, idempotent
- Webhook dead-letter queue + alerting on DLQ depth
- Verification request idempotency keys deployed; duplicate-submission test passes
- Audit-log schema deployed; sample 100 records reviewed by compliance
- All dashboards from §dashboards live and verified against test traffic
- Alerts wired to PagerDuty / equivalent with thresholds calibrated against shadow-mode data
- Rollback flip tested in staging end-to-end (≤ 5 min, no in-flight verification corruption)
- Load tested at 3× expected peak; vendor SLA confirmed under load
- Error paths tested: vendor 5xx, timeout, malformed response, invalid signature
Compliance & ops readiness
- MLRO / CCO sign-off on the verification flow and audit trail
- DPA + MSA + Order Form fully executed, copies stored in contract repository
- Public sub-processor list updated to include new vendor (where required)
- User-facing privacy notice updated; legal sign-off on copy
- Retention policies configured on vendor side per jurisdiction
- Manual review team trained on vendor console; access provisioned via SSO
- Customer support trained on the new flow + dispute process + escalation
- Internal KB articles updated: how to dispute a decline, common error messages, escalation routing
- User-facing dispute / appeal mechanism live and end-to-end tested
- Regulator notifications sent (where the rules require informing them of vendor change)
Day -1 — Final pre-flight
- Go / no-go meeting at end of day. Required attendees: Launch lead, Eng on-call, Compliance, Customer ops, Vendor TAM
- Baseline metrics captured: completion rate, false-reject, latency, manual-review queue depth (last 7d)
- Rollout pct currently 0; confirmed in feature-flag UI by two people
- Vendor TAM confirmed on standby; their engineering team has launch-window awareness
- Vendor's status page checked; no in-flight incidents
- On-call roster confirmed; PagerDuty schedules verified; backup paths tested
- Rollback command pinned in launch channel
- Code freeze on the verification stack from end of Day -1 through end of Day +1
- All dashboards opened in browser tabs, one set per on-call person
Day 0 — Launch day
Flip the switch. Watch. Don't multitask. The whole on-call rotation is on this and only this.
Timeline (typical morning launch)
| Time | Action | Owner |
|---|---|---|
| T-30 min | Roster gathers in launch channel; final go/no-go | Launch lead |
| T-15 min | Snapshot all dashboards; baseline confirmed | Eng on-call |
| T 0 | Set rollout pct from 0 → 1 (one-character config change) | Launch lead, executed by eng on-call |
| T +5 | Confirm first real verifications flowing to new vendor in audit log | Eng on-call |
| T +15 | Sanity check: completion rate stable; no error spike | Launch lead |
| T +60 | First hourly check-in. All-green or define action. | All |
| T +4h | Mid-day review. Manual review queue, support volume, decision distribution. | All |
| T +8h | End-of-day review. Note observations for Day +1. | Launch lead |
What to watch in the first hour
- First verification flowing. If no traffic appears within 10 minutes of the flip, you have a routing bug. Roll back, investigate.
- Decision mix vs baseline. The first 50 decisions should look like the population baseline. A 20-point skew toward decline on the first day is a calibration issue.
- Latency. p95 should match shadow-mode observations. Big regression means infrastructure or vendor issue.
- Webhook delivery. Watch the DLQ; should stay at zero. Anything else is a config problem.
- Manual review queue. Check vendor console manually; confirm reviewers can see & action items.
The temptation to ramp aggressively when things look fine is real, and wrong. The 1% threshold catches issues that don't appear at lower volumes; the 10% threshold catches issues that don't appear at 1%. Give each stage its planned window. The launch is not over until Day +14.
Day +1 — First-day-after
- Overnight metrics reviewed; no alerts fired
- Cohort review: first 24h of new-vendor users vs control. Completion, false-reject, latency.
- Compliance reviews random sample of 50 decisions; audit trail intact
- Customer support reviews tickets; categorize anything mentioning verification
- Dispute rate compared to baseline
- Spot-check vendor billing: does Day 0 invoice match expected cost?
- Vendor TAM debrief: anything they saw on their side
- Launch-day observation doc written (issues + non-issues for future reference)
Day +7 — End of first week
- 7-day metrics review: completion, false-reject, latency, queue depth, dispute rate
- 1% exit-gate criteria from 05 § Exit criteria all passing
- Go / hold decision on ramp to 10%
- Joint review with vendor — any issues from their side, any roadmap topics
- Compliance review: random sample of 200 decisions; audit complete
- Support ticket trend analysis: any emerging patterns?
- Cost reconciliation: actual unit cost vs forecast for week 1
- Lessons-learned doc updated; shared with broader product / eng team
Day +30 — End of first month
- 30-day metric review: every primary & secondary metric from 05 § Metrics
- RFP-claim vs reality comparison: where did the vendor over-promise? Where did they over-deliver?
- Full first-month invoice reconciliation; surprises documented; finance signed off
- First Quarterly Business Review with vendor scheduled (Day +60 or 90)
- Decommissioning plan for prior vendor finalized (if applicable); data export started
- Any incidents in the period: RCAs complete; preventive actions tracked
- Dashboards refactored to steady-state (less granular than launch-mode)
- Project retro: what would you do differently? Write it down for the next vendor decision.
Three years from now someone else will be re-evaluating this vendor. Your honest retro — what scored well in the RFP that didn't translate, what surprised you in production, what the vendor was actually like to work with — saves them 8 weeks. Treat it as the most valuable thing you produce.
Dashboards on the wall
The dashboards that must exist before Day -7. Keep them simple; complex dashboards hide signal.
Launch-mode dashboard (refresh: 1 min)
- Rollout pct (large number, current value)
- Verifications/min by vendor (line chart, last 4 hours)
- Decision mix by vendor (stacked area, approve / decline / review / indeterminate)
- Completion rate by vendor (line chart, last 24h, with 7d baseline overlay)
- p50 / p95 / p99 latency by vendor (line chart)
- Error rate (5xx, timeout, signature failure) by vendor
- Webhook delivery success %, DLQ depth
- Manual review queue depth + oldest item age
Steady-state dashboard (refresh: 5 min, view weekly)
- Day-over-day completion / false-reject / latency, with anomaly bands.
- Per-country breakdown of key metrics.
- Unit cost trend (cost per successful verification).
- Dispute volume and dispositioning.
- Sanctions hit rate and investigation cycle time.
- Downstream fraud rate on approved users (lagging, weekly).
Post-launch monitoring rhythm
| Cadence | What | Owner |
|---|---|---|
| Daily | Primary metrics dashboard scan; alerts triage | Eng on-call |
| Weekly | Metric review meeting; cost reconciliation; support-ticket categorization | PM + Eng lead + Customer ops lead |
| Monthly | Compliance audit-trail sample review (100–200 decisions) | MLRO / compliance |
| Monthly | Vendor invoice reconciliation; surprise-line-item triage | Finance + PM |
| Quarterly | QBR with vendor: roadmap, performance vs SLA, escalations, pricing | PM (chair) |
| Annually | Renew-or-re-evaluate decision; refresh scorecard; check market for new entrants | PM + procurement |
When to re-evaluate or switch vendors
You should know your switch criteria before you ever sign. Pre-commit to triggers so the decision is fact-based, not relationship-based.
Hard triggers (start re-evaluation immediately)
- SLA breached > 3 consecutive months.
- A regulator names the vendor in an issue against you or a peer customer.
- Vendor experiences a material security breach involving customer PII.
- Vendor is acquired by a competitor or a strategically misaligned party.
- Pricing increases > cap in the contract, sustained beyond a re-negotiation window.
- Performance regression (completion / false-reject) of > 3 points without correction in 60 days.
Soft triggers (re-evaluate at next renewal)
- Vendor's roadmap diverges from your direction (e.g., you're going crypto-native, they're going retail-banking).
- Support quality degrades materially.
- Cost benchmarks against the market reveal you're 30%+ above market for comparable service.
- A new entrant has materially better performance on your population.
- You've outgrown the vendor's enterprise readiness (e.g., they don't support your new geos).
The annual decision
Every renewal cycle, refresh the scorecard against current vendor performance and current alternatives. Treat the incumbent as a candidate, not a default. Score them with the same rubric; surface the gap to senior leadership. Most of the time the incumbent wins — but the exercise keeps them honest, and keeps you ready to move if needed.
That's the whole playbook. Scorecard → RFP → DD → orchestration → pilot → runbook. From "we need an IDV vendor" to "we have a signed contract, a live integration, a healthy production deployment, and a plan for switching when the time comes." Now go run it.