Chapter 06 · Working artifact

Go-Live Runbook

The literal checklist for flipping the switch. Day -7 through Day +30. On-call rotation, dashboards, rollback triggers and procedure, and the post-launch monitoring rhythm.

On-call & comms setup

Set this up the week before, not the morning of. Have a named person at every position; an "owner: TBD" line on launch day is how launches fail.

The launch roster

RoleWhoPrimary responsibility
Launch leadPM owning the projectFinal call on go / no-go / rollback. Single decision-maker.
Tech on-call (primary)Eng lead on integrationTriage technical issues; can flip rollout pct
Tech on-call (secondary)Backup engineerTake over if primary unavailable; ≤30 min response
Compliance on-callMLRO / CCO delegateAudit-trail review; regulator-facing if needed
Customer ops leadSupport managerSurface user signals not in metrics; staff queue
Vendor TAMVendor's account-side personBackchannel to vendor engineering during incident
Comms / PRComms leadExternal messaging if launch becomes user-visible

Channels

  • Working channel (e.g., #idv-launch): everyone above + observers. All decisions logged here.
  • Vendor shared channel (Slack Connect / Teams): direct line to vendor engineering.
  • Incident channel: existing #incidents or equivalent for P1.
  • Decision log: pinned doc, one row per decision (timestamp, decision, decider, rationale).
Pin the rollback command at the top

Pin the exact command / config-flag flip to roll back to 0% at the top of the launch channel. If you're improvising it at 3 AM, you've lost.

Day -7 — Pre-launch checklist

One week out. Everything below should be done before Day 0 morning. If anything's still red on Day -3, push the launch.

Technical readiness

  • Vendor integration in production environment, dark-launched for ≥ 7 days
  • Vendor-agnostic abstraction layer in place (no vendor SDK leak into domain code)
  • Rollout config flag wired to feature-flag service, tested both directions
  • Webhook endpoint receiving production webhooks, signed-payload verification working, idempotent
  • Webhook dead-letter queue + alerting on DLQ depth
  • Verification request idempotency keys deployed; duplicate-submission test passes
  • Audit-log schema deployed; sample 100 records reviewed by compliance
  • All dashboards from §dashboards live and verified against test traffic
  • Alerts wired to PagerDuty / equivalent with thresholds calibrated against shadow-mode data
  • Rollback flip tested in staging end-to-end (≤ 5 min, no in-flight verification corruption)
  • Load tested at 3× expected peak; vendor SLA confirmed under load
  • Error paths tested: vendor 5xx, timeout, malformed response, invalid signature

Compliance & ops readiness

  • MLRO / CCO sign-off on the verification flow and audit trail
  • DPA + MSA + Order Form fully executed, copies stored in contract repository
  • Public sub-processor list updated to include new vendor (where required)
  • User-facing privacy notice updated; legal sign-off on copy
  • Retention policies configured on vendor side per jurisdiction
  • Manual review team trained on vendor console; access provisioned via SSO
  • Customer support trained on the new flow + dispute process + escalation
  • Internal KB articles updated: how to dispute a decline, common error messages, escalation routing
  • User-facing dispute / appeal mechanism live and end-to-end tested
  • Regulator notifications sent (where the rules require informing them of vendor change)

Day -1 — Final pre-flight

  • Go / no-go meeting at end of day. Required attendees: Launch lead, Eng on-call, Compliance, Customer ops, Vendor TAM
  • Baseline metrics captured: completion rate, false-reject, latency, manual-review queue depth (last 7d)
  • Rollout pct currently 0; confirmed in feature-flag UI by two people
  • Vendor TAM confirmed on standby; their engineering team has launch-window awareness
  • Vendor's status page checked; no in-flight incidents
  • On-call roster confirmed; PagerDuty schedules verified; backup paths tested
  • Rollback command pinned in launch channel
  • Code freeze on the verification stack from end of Day -1 through end of Day +1
  • All dashboards opened in browser tabs, one set per on-call person

Day 0 — Launch day

Flip the switch. Watch. Don't multitask. The whole on-call rotation is on this and only this.

Timeline (typical morning launch)

TimeActionOwner
T-30 minRoster gathers in launch channel; final go/no-goLaunch lead
T-15 minSnapshot all dashboards; baseline confirmedEng on-call
T 0Set rollout pct from 0 → 1 (one-character config change)Launch lead, executed by eng on-call
T +5Confirm first real verifications flowing to new vendor in audit logEng on-call
T +15Sanity check: completion rate stable; no error spikeLaunch lead
T +60First hourly check-in. All-green or define action.All
T +4hMid-day review. Manual review queue, support volume, decision distribution.All
T +8hEnd-of-day review. Note observations for Day +1.Launch lead

What to watch in the first hour

  1. First verification flowing. If no traffic appears within 10 minutes of the flip, you have a routing bug. Roll back, investigate.
  2. Decision mix vs baseline. The first 50 decisions should look like the population baseline. A 20-point skew toward decline on the first day is a calibration issue.
  3. Latency. p95 should match shadow-mode observations. Big regression means infrastructure or vendor issue.
  4. Webhook delivery. Watch the DLQ; should stay at zero. Anything else is a config problem.
  5. Manual review queue. Check vendor console manually; confirm reviewers can see & action items.
Don't go to 10% on Day 0

The temptation to ramp aggressively when things look fine is real, and wrong. The 1% threshold catches issues that don't appear at lower volumes; the 10% threshold catches issues that don't appear at 1%. Give each stage its planned window. The launch is not over until Day +14.

Day +1 — First-day-after

  • Overnight metrics reviewed; no alerts fired
  • Cohort review: first 24h of new-vendor users vs control. Completion, false-reject, latency.
  • Compliance reviews random sample of 50 decisions; audit trail intact
  • Customer support reviews tickets; categorize anything mentioning verification
  • Dispute rate compared to baseline
  • Spot-check vendor billing: does Day 0 invoice match expected cost?
  • Vendor TAM debrief: anything they saw on their side
  • Launch-day observation doc written (issues + non-issues for future reference)

Day +7 — End of first week

  • 7-day metrics review: completion, false-reject, latency, queue depth, dispute rate
  • 1% exit-gate criteria from 05 § Exit criteria all passing
  • Go / hold decision on ramp to 10%
  • Joint review with vendor — any issues from their side, any roadmap topics
  • Compliance review: random sample of 200 decisions; audit complete
  • Support ticket trend analysis: any emerging patterns?
  • Cost reconciliation: actual unit cost vs forecast for week 1
  • Lessons-learned doc updated; shared with broader product / eng team

Day +30 — End of first month

  • 30-day metric review: every primary & secondary metric from 05 § Metrics
  • RFP-claim vs reality comparison: where did the vendor over-promise? Where did they over-deliver?
  • Full first-month invoice reconciliation; surprises documented; finance signed off
  • First Quarterly Business Review with vendor scheduled (Day +60 or 90)
  • Decommissioning plan for prior vendor finalized (if applicable); data export started
  • Any incidents in the period: RCAs complete; preventive actions tracked
  • Dashboards refactored to steady-state (less granular than launch-mode)
  • Project retro: what would you do differently? Write it down for the next vendor decision.
The retro is the artifact your successor will thank you for

Three years from now someone else will be re-evaluating this vendor. Your honest retro — what scored well in the RFP that didn't translate, what surprised you in production, what the vendor was actually like to work with — saves them 8 weeks. Treat it as the most valuable thing you produce.

Dashboards on the wall

The dashboards that must exist before Day -7. Keep them simple; complex dashboards hide signal.

Launch-mode dashboard (refresh: 1 min)

  • Rollout pct (large number, current value)
  • Verifications/min by vendor (line chart, last 4 hours)
  • Decision mix by vendor (stacked area, approve / decline / review / indeterminate)
  • Completion rate by vendor (line chart, last 24h, with 7d baseline overlay)
  • p50 / p95 / p99 latency by vendor (line chart)
  • Error rate (5xx, timeout, signature failure) by vendor
  • Webhook delivery success %, DLQ depth
  • Manual review queue depth + oldest item age

Steady-state dashboard (refresh: 5 min, view weekly)

  • Day-over-day completion / false-reject / latency, with anomaly bands.
  • Per-country breakdown of key metrics.
  • Unit cost trend (cost per successful verification).
  • Dispute volume and dispositioning.
  • Sanctions hit rate and investigation cycle time.
  • Downstream fraud rate on approved users (lagging, weekly).

Post-launch monitoring rhythm

CadenceWhatOwner
DailyPrimary metrics dashboard scan; alerts triageEng on-call
WeeklyMetric review meeting; cost reconciliation; support-ticket categorizationPM + Eng lead + Customer ops lead
MonthlyCompliance audit-trail sample review (100–200 decisions)MLRO / compliance
MonthlyVendor invoice reconciliation; surprise-line-item triageFinance + PM
QuarterlyQBR with vendor: roadmap, performance vs SLA, escalations, pricingPM (chair)
AnnuallyRenew-or-re-evaluate decision; refresh scorecard; check market for new entrantsPM + procurement

When to re-evaluate or switch vendors

You should know your switch criteria before you ever sign. Pre-commit to triggers so the decision is fact-based, not relationship-based.

Hard triggers (start re-evaluation immediately)

  • SLA breached > 3 consecutive months.
  • A regulator names the vendor in an issue against you or a peer customer.
  • Vendor experiences a material security breach involving customer PII.
  • Vendor is acquired by a competitor or a strategically misaligned party.
  • Pricing increases > cap in the contract, sustained beyond a re-negotiation window.
  • Performance regression (completion / false-reject) of > 3 points without correction in 60 days.

Soft triggers (re-evaluate at next renewal)

  • Vendor's roadmap diverges from your direction (e.g., you're going crypto-native, they're going retail-banking).
  • Support quality degrades materially.
  • Cost benchmarks against the market reveal you're 30%+ above market for comparable service.
  • A new entrant has materially better performance on your population.
  • You've outgrown the vendor's enterprise readiness (e.g., they don't support your new geos).

The annual decision

Every renewal cycle, refresh the scorecard against current vendor performance and current alternatives. Treat the incumbent as a candidate, not a default. Score them with the same rubric; surface the gap to senior leadership. Most of the time the incumbent wins — but the exercise keeps them honest, and keeps you ready to move if needed.

You're done

That's the whole playbook. Scorecard → RFP → DD → orchestration → pilot → runbook. From "we need an IDV vendor" to "we have a signed contract, a live integration, a healthy production deployment, and a plan for switching when the time comes." Now go run it.