Chapter 05 · Working artifact

Pilot & Rollout

From sandbox bake-off through to 100% production traffic. Sample sizes, success metrics, the 0% → 1% → 10% → 50% → 100% ramp, the geo-staging plan, and the rollback triggers.

Phase 0 — Sandbox bake-off (pre-signature)

This happens before you sign the contract, during the negotiation phase. It's the single highest-yield activity in the whole evaluation. RFP responses are claims; the bake-off is evidence.

What you run

Take a fixed sample set of synthetic verifications — covering each jurisdiction in scope, each document type you expect, each demographic mix, plus deliberate failure cases — and run it through each shortlisted vendor's sandbox. Score the results yourself.

Sample set composition (200–500 cases total)

Bucket	What	Approx %
Happy path	Clean docs, good lighting, expected demographics	40%
Difficult-but-valid	Older docs, glare, partial occlusion, accents, low-end Android phones	25%
Demographic edge cases	Skin-tone diversity, age extremes, transgender (legal name vs current presentation), glasses, beards	10%
Document edge cases	Recently issued docs, soon-to-expire, military ID, refugee travel doc, residence permit	10%
Synthetic fraud (positive)	Known spoofs: printed photos, screen replay, deepfake (where ethically obtainable)	10%
Synthetic invalid (negative)	Cropped docs, blurry, wrong-doc-type, expired	5%

A note on synthetic fraud testing

Don't try to spoof vendors with real fraud documents. Use vendor-provided test fixtures or publicly published academic datasets. The goal is consistency across vendors, not actually committing fraud. Any synthetic data you generate should be clearly labeled, retained securely, and destroyed at end of evaluation.

What to record per vendor

Decision (approve / decline / review / indeterminate) for every case.
Time-to-decision (median, p95).
Vendor reason codes (do they map sensibly to your taxonomy?).
Confidence score where available.
API errors, timeouts, retries.
Cost per check (compare against quoted pricing).

Engineer time

Allow 1 engineer week per vendor for sandbox setup + integration + running the sample set. Time-to-first-verification is itself a scored criterion — a vendor whose sandbox eats 3 days is signalling integration cost.

What to do if the vendor refuses a structured bake-off

Sales teams sometimes resist structured bake-offs because they expose weaknesses against competitors. If a vendor refuses or insists on running the test themselves with their data: that's a kill-switch trigger (revisit K7 in 01 § Kill switches). The polite framing: "We're not asking for anything more than what we'd do once live; this is our standard pre-signature evaluation."

Phase 1 — Pilot design (post-signature, pre-full-launch)

The pilot is where you stop simulating and start running real production traffic. The goal: validate every assumption from the RFP and sandbox in your live environment, with real users, before you depend on the vendor for revenue.

Pilot structure

Stage	% traffic	Duration	Purpose
P0 — Internal alpha	0%	3–5 days	Employees only; verify integration, webhooks, manual-review console
P1 — Closed beta	0%	1 week	Allowlist of ~50 friendly users; full flow including support & dispute
P2 — Production pilot	1%	2 weeks	Stable hash-routed; measure all metrics
P3 — Production ramp	10%	2 weeks	Scale validates throughput, support load, dispute volume
P4 — Production majority	50%	2 weeks	Half of traffic; vendor is now load-bearing
P5 — Full production	100%	Steady state	Cutover; old vendor in fallback mode for 30 days

Each gate must pass exit criteria (below) before advancing. Plan ~9–10 weeks from go-live to 100%. Don't rush it; the only thing more expensive than a slow rollout is a fast one that breaks.

Sample sizes — how long does each stage need?

Sample size depends on what you're trying to detect. The numbers below assume p=0.05, 80% power, and the typical metric ranges for consumer IDV. Use them as a planning floor.

What you're detecting	Baseline	Detectable diff	Sample size per arm
Completion rate change	85%	1 pt (84% vs 85%)	~25,000
Completion rate change	85%	2 pt (83% vs 85%)	~6,500
False-reject rate change	2.0%	0.5 pt	~50,000
False-reject rate change	2.0%	1.0 pt	~13,000
p95 latency regression	30s	5s	~500 (Wilcoxon)
Manual-review rate change	5.0%	1 pt	~7,500

How to plan

Estimate your daily verification volume.
Decide which metrics are deal-breakers (typically: completion rate, false-reject rate).
Compute the days at each % level needed to hit stat-sig on each metric.
Pick the longer of: (sample-size duration) and (operational-load duration — i.e., long enough that support/dispute volume stabilizes).

Operational time beats statistical time

If you can hit stat-sig in 3 days but a support-load spike could take 10 days to surface, run 14 days minimum at each gate. Stat-sig is a floor, not a ceiling.

What to measure

Build the dashboards in 04 § Observability before launch. These are the metrics that gate progression at each stage.

Primary metrics (deal-breakers)

Metric	Definition	Pilot threshold (typical)
Completion rate	started → verified, of users who initiated	≥ baseline – 2 pts
Auto-approval rate	approved without manual review, of verified	≥ baseline – 3 pts
False-reject rate	rejected users who pass dispute/appeal, of rejected	≤ baseline + 0.5 pt
Median time-to-decision	full path: started → final decision	≤ 60s for auto-decisions
p95 time-to-decision	same, p95	≤ 2.5 min for auto-decisions
Manual-review queue depth at peak	open reviews waiting > 1 hour	Below your ops capacity

Secondary metrics (watch list)

Dispute rate — users contesting a decline.
NPS / CSAT on the verification flow (in-product survey).
Cost per successful verification (blended unit cost / completion rate).
Sanctions hit rate and hit-investigation throughput.
Webhook delivery success rate (≥ 99.9% required).
API error rate by vendor 5xx / 4xx / timeout.
Geographic distribution of decline rate — surface jurisdiction-specific regressions.

Tertiary metrics (post-launch monitoring)

Downstream fraud rate on approved users (lagging indicator; needs 30–90 days).
Re-verification trigger rate.
Cohort-aged unit economics.
Vendor invoice variance vs forecast.

Pilot exit criteria — gates between stages

Each gate is a hard yes/no. Fail any criterion at a gate, you stay at that % level until it's resolved. Don't soften the gates under pressure; that's how bad launches happen.

Completion rate ≥ pre-pilot baseline – 2 pts
False-reject rate ≤ pre-pilot baseline + 0.5 pt
p95 time-to-decision ≤ 2.5 min for auto-decisions
Webhook delivery ≥ 99.9% during measurement window
Vendor API error rate ≤ 1% on 5-min windows, no sustained spikes
Manual review queue did not exceed 4× baseline peak depth
Support ticket volume on verification topics within 1.5× baseline
Dispute rate within 1.5× baseline
No vendor SLA breach during stage
No P1 incidents attributable to the integration
Compliance / MLRO sign-off on the audit trail review (random sample of 100 decisions)
Cost per successful verification within 10% of forecast

Phased rollout: 0% → 1% → 10% → 50% → 100%

How to assign users to stages

Use a deterministic hash of user_id mod 10,000 — gives you fine-grained control. Sticky assignment is mandatory — a user who starts on the new vendor stays on it, even if their verification fails and they retry.

import hashlib

def route_vendor(user_id: str, rollout_pct: float, sticky: bool = True) -> str:
    """
    rollout_pct: 0.0–100.0, what % of users should go to new vendor.
    Sticky: same user always gets the same answer for the same rollout_pct.
    """
    h = int(hashlib.sha256(f"vendor-rollout:{user_id}".encode()).hexdigest()[:8], 16)
    bucket = (h % 10000) / 100.0   # 0.00–99.99
    return "vendor_new" if bucket < rollout_pct else "vendor_old"

# Wired to LaunchDarkly / Statsig / your config service:
ROLLOUT_PCT = config.get("idv.new_vendor.pct", default=0.0)

Stage-by-stage actions

Stage	Eng actions	Ops actions	Comms actions
0% (dark launch)	Shadow-mode: call both vendors, compare results, never use new vendor's decision	Manual review team trained on the new console	Internal-only; no user-facing change
1%	Real decisions for 1% of new users	Daily check-in on manual review queue	None user-facing
10%	No code changes; observe	Twice-weekly check-in; review 100 random decisions	Optional: in-product banner if flow visually changes
50%	Confirm cost forecast holds; tune any rules	Weekly review	Customer support brief on differences (if any)
100%	Old vendor in fallback / write-only for 30 days	Decommission planning for old vendor	None unless retention SLA changes for user-facing artifacts

The dark-launch window pays for itself

Run both vendors in parallel — new vendor's decisions logged but not acted on — for 7–14 days before the 1% gate. The disagreement analysis (when do they decide differently? on what populations?) is the single best risk-mitigation artifact you can produce. Free, except for the vendor unit costs.

Geo-staging playbook for multi-jurisdiction launches

If you operate in 5+ countries, don't ramp everywhere simultaneously. Stage by jurisdiction. Order matters.

Ordering principles

Start where you have low regulatory exposure. A country where you're not yet fully licensed, or where IDV is voluntary rather than mandatory, is safest first.
Start where the vendor is strongest. Vendor performance varies enormously by country; sequence to early wins.
Match volume to operations capacity. Don't ramp your largest country on day 1; you'll overwhelm manual review.
End with your most strategic country. By the time you're rolling out in your top market, you've absorbed all the lessons.

Example sequence (illustrative, US-headquartered fintech expanding to EU + LATAM)

Week	Countries activated	Why
1–2	Internal / employees globally	Validate integration end-to-end
3–4	Ireland, Portugal (small EU markets, vendor strong)	Real users, low blast radius
5–6	+ Spain, Netherlands	Bigger EU markets, vendor familiar
7–8	+ Germany (with care; BaFin scrutiny high)	Largest regulated EU market
9–10	+ Mexico, Colombia	Different document mix; tests EM coverage
11–12	+ Brazil	Largest LATAM market
13–14	+ United Kingdom	FCA scrutiny — saved until you've stabilized
15+	+ United States	Largest market; everything must be working

Per-country metrics matter more than aggregate

You can have a healthy global completion rate and a 20-point regression in Mexico, masked by US volume. Always gate per country, not just globally, and build dashboards split by country before launch.

Rollback criteria

Rollback is not failure — it's the system working. Pre-commit to rollback triggers so the team isn't deciding under pressure. Anyone on the on-call rotation should be able to invoke rollback unilaterally based on these triggers.

Automatic rollback triggers

Vendor API error rate > 5% sustained 10 minutes.
p95 latency > 5× normal sustained 15 minutes.
Completion rate down > 5 points day-over-day on a stable population.
Webhook delivery failure rate > 1% sustained 30 minutes.
Vendor declares P1 incident.

Human-judgment rollback triggers

Manual review queue backing up beyond ops capacity.
Cluster of user reports indicating a regression invisible to metrics.
Compliance / MLRO observes audit-trail issues.
Regulator inquiry.
Unit cost spiking beyond 20% of forecast.

Rollback procedure

Page on-call + IDV PM + MLRO/CCO via the same channel (don't fragment).
Confirm trigger from dashboards (5-minute timebox, then act).
Flip idv.new_vendor.pct to 0 in your config service.
Verify in monitoring that new traffic is on old vendor within 5 minutes.
For in-flight verifications: let them complete naturally; do not re-route mid-flow.
Comms: internal post in #incidents within 15 minutes. External comms only if user-facing impact > 5 minutes.
Post-incident: RCA within 5 business days. Decide go/no-go on next attempt.

What to keep running when you've rolled back

Keep the new vendor receiving 1% of traffic in shadow mode (decision logged, not acted on) so you can validate the fix when the vendor reports it's resolved. Don't take them out completely — the cost of re-onboarding is too high.