Chapter 06 · Working artifact

Go-Live Runbook

The literal checklist for flipping the switch. Day -7 through Day +30. On-call rotation, dashboards, rollback triggers and procedure, and the post-launch monitoring rhythm.

On-call & comms setup

Set this up the week before, not the morning of. Have a named person at every position; an "owner: TBD" line on launch day is how launches fail.

The launch roster

Role	Who	Primary responsibility
Launch lead	PM owning the project	Final call on go / no-go / rollback. Single decision-maker.
Tech on-call (primary)	Eng lead on integration	Triage technical issues; can flip rollout pct
Tech on-call (secondary)	Backup engineer	Take over if primary unavailable; ≤30 min response
Compliance on-call	MLRO / CCO delegate	Audit-trail review; regulator-facing if needed
Customer ops lead	Support manager	Surface user signals not in metrics; staff queue
Vendor TAM	Vendor's account-side person	Backchannel to vendor engineering during incident
Comms / PR	Comms lead	External messaging if launch becomes user-visible

Channels

Working channel (e.g., #idv-launch): everyone above + observers. All decisions logged here.
Vendor shared channel (Slack Connect / Teams): direct line to vendor engineering.
Incident channel: existing #incidents or equivalent for P1.
Decision log: pinned doc, one row per decision (timestamp, decision, decider, rationale).

Pin the rollback command at the top

Pin the exact command / config-flag flip to roll back to 0% at the top of the launch channel. If you're improvising it at 3 AM, you've lost.

Day -7 — Pre-launch checklist

One week out. Everything below should be done before Day 0 morning. If anything's still red on Day -3, push the launch.

Technical readiness

Vendor integration in production environment, dark-launched for ≥ 7 days
Vendor-agnostic abstraction layer in place (no vendor SDK leak into domain code)
Rollout config flag wired to feature-flag service, tested both directions
Webhook endpoint receiving production webhooks, signed-payload verification working, idempotent
Webhook dead-letter queue + alerting on DLQ depth
Verification request idempotency keys deployed; duplicate-submission test passes
Audit-log schema deployed; sample 100 records reviewed by compliance
All dashboards from §dashboards live and verified against test traffic
Alerts wired to PagerDuty / equivalent with thresholds calibrated against shadow-mode data
Rollback flip tested in staging end-to-end (≤ 5 min, no in-flight verification corruption)
Load tested at 3× expected peak; vendor SLA confirmed under load
Error paths tested: vendor 5xx, timeout, malformed response, invalid signature

Compliance & ops readiness

MLRO / CCO sign-off on the verification flow and audit trail
DPA + MSA + Order Form fully executed, copies stored in contract repository
Public sub-processor list updated to include new vendor (where required)
User-facing privacy notice updated; legal sign-off on copy
Retention policies configured on vendor side per jurisdiction
Manual review team trained on vendor console; access provisioned via SSO
Customer support trained on the new flow + dispute process + escalation
Internal KB articles updated: how to dispute a decline, common error messages, escalation routing
User-facing dispute / appeal mechanism live and end-to-end tested
Regulator notifications sent (where the rules require informing them of vendor change)

Day -1 — Final pre-flight

Go / no-go meeting at end of day. Required attendees: Launch lead, Eng on-call, Compliance, Customer ops, Vendor TAM
Baseline metrics captured: completion rate, false-reject, latency, manual-review queue depth (last 7d)
Rollout pct currently 0; confirmed in feature-flag UI by two people
Vendor TAM confirmed on standby; their engineering team has launch-window awareness
Vendor's status page checked; no in-flight incidents
On-call roster confirmed; PagerDuty schedules verified; backup paths tested
Rollback command pinned in launch channel
Code freeze on the verification stack from end of Day -1 through end of Day +1
All dashboards opened in browser tabs, one set per on-call person

Day 0 — Launch day

Flip the switch. Watch. Don't multitask. The whole on-call rotation is on this and only this.

Timeline (typical morning launch)

Time	Action	Owner
T-30 min	Roster gathers in launch channel; final go/no-go	Launch lead
T-15 min	Snapshot all dashboards; baseline confirmed	Eng on-call
T 0	Set rollout pct from 0 → 1 (one-character config change)	Launch lead, executed by eng on-call
T +5	Confirm first real verifications flowing to new vendor in audit log	Eng on-call
T +15	Sanity check: completion rate stable; no error spike	Launch lead
T +60	First hourly check-in. All-green or define action.	All
T +4h	Mid-day review. Manual review queue, support volume, decision distribution.	All
T +8h	End-of-day review. Note observations for Day +1.	Launch lead

What to watch in the first hour

First verification flowing. If no traffic appears within 10 minutes of the flip, you have a routing bug. Roll back, investigate.
Decision mix vs baseline. The first 50 decisions should look like the population baseline. A 20-point skew toward decline on the first day is a calibration issue.
Latency. p95 should match shadow-mode observations. Big regression means infrastructure or vendor issue.
Webhook delivery. Watch the DLQ; should stay at zero. Anything else is a config problem.
Manual review queue. Check vendor console manually; confirm reviewers can see & action items.

Don't go to 10% on Day 0

The temptation to ramp aggressively when things look fine is real, and wrong. The 1% threshold catches issues that don't appear at lower volumes; the 10% threshold catches issues that don't appear at 1%. Give each stage its planned window. The launch is not over until Day +14.

Day +1 — First-day-after

Overnight metrics reviewed; no alerts fired
Cohort review: first 24h of new-vendor users vs control. Completion, false-reject, latency.
Compliance reviews random sample of 50 decisions; audit trail intact
Customer support reviews tickets; categorize anything mentioning verification
Dispute rate compared to baseline
Spot-check vendor billing: does Day 0 invoice match expected cost?
Vendor TAM debrief: anything they saw on their side
Launch-day observation doc written (issues + non-issues for future reference)

Day +7 — End of first week

7-day metrics review: completion, false-reject, latency, queue depth, dispute rate
1% exit-gate criteria from 05 § Exit criteria all passing
Go / hold decision on ramp to 10%
Joint review with vendor — any issues from their side, any roadmap topics
Compliance review: random sample of 200 decisions; audit complete
Support ticket trend analysis: any emerging patterns?
Cost reconciliation: actual unit cost vs forecast for week 1
Lessons-learned doc updated; shared with broader product / eng team

Day +30 — End of first month

30-day metric review: every primary & secondary metric from 05 § Metrics
RFP-claim vs reality comparison: where did the vendor over-promise? Where did they over-deliver?
Full first-month invoice reconciliation; surprises documented; finance signed off
First Quarterly Business Review with vendor scheduled (Day +60 or 90)
Decommissioning plan for prior vendor finalized (if applicable); data export started
Any incidents in the period: RCAs complete; preventive actions tracked
Dashboards refactored to steady-state (less granular than launch-mode)
Project retro: what would you do differently? Write it down for the next vendor decision.

The retro is the artifact your successor will thank you for

Three years from now someone else will be re-evaluating this vendor. Your honest retro — what scored well in the RFP that didn't translate, what surprised you in production, what the vendor was actually like to work with — saves them 8 weeks. Treat it as the most valuable thing you produce.

Dashboards on the wall

The dashboards that must exist before Day -7. Keep them simple; complex dashboards hide signal.

Launch-mode dashboard (refresh: 1 min)

Rollout pct (large number, current value)
Verifications/min by vendor (line chart, last 4 hours)
Decision mix by vendor (stacked area, approve / decline / review / indeterminate)
Completion rate by vendor (line chart, last 24h, with 7d baseline overlay)
p50 / p95 / p99 latency by vendor (line chart)
Error rate (5xx, timeout, signature failure) by vendor
Webhook delivery success %, DLQ depth
Manual review queue depth + oldest item age

Steady-state dashboard (refresh: 5 min, view weekly)

Day-over-day completion / false-reject / latency, with anomaly bands.
Per-country breakdown of key metrics.
Unit cost trend (cost per successful verification).
Dispute volume and dispositioning.
Sanctions hit rate and investigation cycle time.
Downstream fraud rate on approved users (lagging, weekly).

Post-launch monitoring rhythm

Cadence	What	Owner
Daily	Primary metrics dashboard scan; alerts triage	Eng on-call
Weekly	Metric review meeting; cost reconciliation; support-ticket categorization	PM + Eng lead + Customer ops lead
Monthly	Compliance audit-trail sample review (100–200 decisions)	MLRO / compliance
Monthly	Vendor invoice reconciliation; surprise-line-item triage	Finance + PM
Quarterly	QBR with vendor: roadmap, performance vs SLA, escalations, pricing	PM (chair)
Annually	Renew-or-re-evaluate decision; refresh scorecard; check market for new entrants	PM + procurement

When to re-evaluate or switch vendors

You should know your switch criteria before you ever sign. Pre-commit to triggers so the decision is fact-based, not relationship-based.

Hard triggers (start re-evaluation immediately)

SLA breached > 3 consecutive months.
A regulator names the vendor in an issue against you or a peer customer.
Vendor experiences a material security breach involving customer PII.
Vendor is acquired by a competitor or a strategically misaligned party.
Pricing increases > cap in the contract, sustained beyond a re-negotiation window.
Performance regression (completion / false-reject) of > 3 points without correction in 60 days.

Soft triggers (re-evaluate at next renewal)

Vendor's roadmap diverges from your direction (e.g., you're going crypto-native, they're going retail-banking).
Support quality degrades materially.
Cost benchmarks against the market reveal you're 30%+ above market for comparable service.
A new entrant has materially better performance on your population.
You've outgrown the vendor's enterprise readiness (e.g., they don't support your new geos).

The annual decision

Every renewal cycle, refresh the scorecard against current vendor performance and current alternatives. Treat the incumbent as a candidate, not a default. Score them with the same rubric; surface the gap to senior leadership. Most of the time the incumbent wins — but the exercise keeps them honest, and keeps you ready to move if needed.

You're done

That's the whole playbook. Scorecard → RFP → DD → orchestration → pilot → runbook. From "we need an IDV vendor" to "we have a signed contract, a live integration, a healthy production deployment, and a plan for switching when the time comes." Now go run it.