Chapter 05 · Working artifact

Pilot & Rollout

From sandbox bake-off through to 100% production traffic. Sample sizes, success metrics, the 0% → 1% → 10% → 50% → 100% ramp, the geo-staging plan, and the rollback triggers.

Phase 0 — Sandbox bake-off (pre-signature)

This happens before you sign the contract, during the negotiation phase. It's the single highest-yield activity in the whole evaluation. RFP responses are claims; the bake-off is evidence.

What you run

Take a fixed sample set of synthetic verifications — covering each jurisdiction in scope, each document type you expect, each demographic mix, plus deliberate failure cases — and run it through each shortlisted vendor's sandbox. Score the results yourself.

Sample set composition (200–500 cases total)

BucketWhatApprox %
Happy pathClean docs, good lighting, expected demographics40%
Difficult-but-validOlder docs, glare, partial occlusion, accents, low-end Android phones25%
Demographic edge casesSkin-tone diversity, age extremes, transgender (legal name vs current presentation), glasses, beards10%
Document edge casesRecently issued docs, soon-to-expire, military ID, refugee travel doc, residence permit10%
Synthetic fraud (positive)Known spoofs: printed photos, screen replay, deepfake (where ethically obtainable)10%
Synthetic invalid (negative)Cropped docs, blurry, wrong-doc-type, expired5%
A note on synthetic fraud testing

Don't try to spoof vendors with real fraud documents. Use vendor-provided test fixtures or publicly published academic datasets. The goal is consistency across vendors, not actually committing fraud. Any synthetic data you generate should be clearly labeled, retained securely, and destroyed at end of evaluation.

What to record per vendor

  • Decision (approve / decline / review / indeterminate) for every case.
  • Time-to-decision (median, p95).
  • Vendor reason codes (do they map sensibly to your taxonomy?).
  • Confidence score where available.
  • API errors, timeouts, retries.
  • Cost per check (compare against quoted pricing).

Engineer time

Allow 1 engineer week per vendor for sandbox setup + integration + running the sample set. Time-to-first-verification is itself a scored criterion — a vendor whose sandbox eats 3 days is signalling integration cost.

What to do if the vendor refuses a structured bake-off

Sales teams sometimes resist structured bake-offs because they expose weaknesses against competitors. If a vendor refuses or insists on running the test themselves with their data: that's a kill-switch trigger (revisit K7 in 01 § Kill switches). The polite framing: "We're not asking for anything more than what we'd do once live; this is our standard pre-signature evaluation."

Phase 1 — Pilot design (post-signature, pre-full-launch)

The pilot is where you stop simulating and start running real production traffic. The goal: validate every assumption from the RFP and sandbox in your live environment, with real users, before you depend on the vendor for revenue.

Pilot structure

Stage% trafficDurationPurpose
P0 — Internal alpha0%3–5 daysEmployees only; verify integration, webhooks, manual-review console
P1 — Closed beta0%1 weekAllowlist of ~50 friendly users; full flow including support & dispute
P2 — Production pilot1%2 weeksStable hash-routed; measure all metrics
P3 — Production ramp10%2 weeksScale validates throughput, support load, dispute volume
P4 — Production majority50%2 weeksHalf of traffic; vendor is now load-bearing
P5 — Full production100%Steady stateCutover; old vendor in fallback mode for 30 days

Each gate must pass exit criteria (below) before advancing. Plan ~9–10 weeks from go-live to 100%. Don't rush it; the only thing more expensive than a slow rollout is a fast one that breaks.

Sample sizes — how long does each stage need?

Sample size depends on what you're trying to detect. The numbers below assume p=0.05, 80% power, and the typical metric ranges for consumer IDV. Use them as a planning floor.

What you're detectingBaselineDetectable diffSample size per arm
Completion rate change85%1 pt (84% vs 85%)~25,000
Completion rate change85%2 pt (83% vs 85%)~6,500
False-reject rate change2.0%0.5 pt~50,000
False-reject rate change2.0%1.0 pt~13,000
p95 latency regression30s5s~500 (Wilcoxon)
Manual-review rate change5.0%1 pt~7,500

How to plan

  • Estimate your daily verification volume.
  • Decide which metrics are deal-breakers (typically: completion rate, false-reject rate).
  • Compute the days at each % level needed to hit stat-sig on each metric.
  • Pick the longer of: (sample-size duration) and (operational-load duration — i.e., long enough that support/dispute volume stabilizes).
Operational time beats statistical time

If you can hit stat-sig in 3 days but a support-load spike could take 10 days to surface, run 14 days minimum at each gate. Stat-sig is a floor, not a ceiling.

What to measure

Build the dashboards in 04 § Observability before launch. These are the metrics that gate progression at each stage.

Primary metrics (deal-breakers)

MetricDefinitionPilot threshold (typical)
Completion ratestarted → verified, of users who initiated≥ baseline – 2 pts
Auto-approval rateapproved without manual review, of verified≥ baseline – 3 pts
False-reject raterejected users who pass dispute/appeal, of rejected≤ baseline + 0.5 pt
Median time-to-decisionfull path: started → final decision≤ 60s for auto-decisions
p95 time-to-decisionsame, p95≤ 2.5 min for auto-decisions
Manual-review queue depth at peakopen reviews waiting > 1 hourBelow your ops capacity

Secondary metrics (watch list)

  • Dispute rate — users contesting a decline.
  • NPS / CSAT on the verification flow (in-product survey).
  • Cost per successful verification (blended unit cost / completion rate).
  • Sanctions hit rate and hit-investigation throughput.
  • Webhook delivery success rate (≥ 99.9% required).
  • API error rate by vendor 5xx / 4xx / timeout.
  • Geographic distribution of decline rate — surface jurisdiction-specific regressions.

Tertiary metrics (post-launch monitoring)

  • Downstream fraud rate on approved users (lagging indicator; needs 30–90 days).
  • Re-verification trigger rate.
  • Cohort-aged unit economics.
  • Vendor invoice variance vs forecast.

Pilot exit criteria — gates between stages

Each gate is a hard yes/no. Fail any criterion at a gate, you stay at that % level until it's resolved. Don't soften the gates under pressure; that's how bad launches happen.

  • Completion rate ≥ pre-pilot baseline – 2 pts
  • False-reject rate ≤ pre-pilot baseline + 0.5 pt
  • p95 time-to-decision ≤ 2.5 min for auto-decisions
  • Webhook delivery ≥ 99.9% during measurement window
  • Vendor API error rate ≤ 1% on 5-min windows, no sustained spikes
  • Manual review queue did not exceed 4× baseline peak depth
  • Support ticket volume on verification topics within 1.5× baseline
  • Dispute rate within 1.5× baseline
  • No vendor SLA breach during stage
  • No P1 incidents attributable to the integration
  • Compliance / MLRO sign-off on the audit trail review (random sample of 100 decisions)
  • Cost per successful verification within 10% of forecast

Phased rollout: 0% → 1% → 10% → 50% → 100%

How to assign users to stages

Use a deterministic hash of user_id mod 10,000 — gives you fine-grained control. Sticky assignment is mandatory — a user who starts on the new vendor stays on it, even if their verification fails and they retry.

import hashlib

def route_vendor(user_id: str, rollout_pct: float, sticky: bool = True) -> str:
    """
    rollout_pct: 0.0–100.0, what % of users should go to new vendor.
    Sticky: same user always gets the same answer for the same rollout_pct.
    """
    h = int(hashlib.sha256(f"vendor-rollout:{user_id}".encode()).hexdigest()[:8], 16)
    bucket = (h % 10000) / 100.0   # 0.00–99.99
    return "vendor_new" if bucket < rollout_pct else "vendor_old"

# Wired to LaunchDarkly / Statsig / your config service:
ROLLOUT_PCT = config.get("idv.new_vendor.pct", default=0.0)

Stage-by-stage actions

StageEng actionsOps actionsComms actions
0% (dark launch)Shadow-mode: call both vendors, compare results, never use new vendor's decisionManual review team trained on the new consoleInternal-only; no user-facing change
1%Real decisions for 1% of new usersDaily check-in on manual review queueNone user-facing
10%No code changes; observeTwice-weekly check-in; review 100 random decisionsOptional: in-product banner if flow visually changes
50%Confirm cost forecast holds; tune any rulesWeekly reviewCustomer support brief on differences (if any)
100%Old vendor in fallback / write-only for 30 daysDecommission planning for old vendorNone unless retention SLA changes for user-facing artifacts
The dark-launch window pays for itself

Run both vendors in parallel — new vendor's decisions logged but not acted on — for 7–14 days before the 1% gate. The disagreement analysis (when do they decide differently? on what populations?) is the single best risk-mitigation artifact you can produce. Free, except for the vendor unit costs.

Geo-staging playbook for multi-jurisdiction launches

If you operate in 5+ countries, don't ramp everywhere simultaneously. Stage by jurisdiction. Order matters.

Ordering principles

  1. Start where you have low regulatory exposure. A country where you're not yet fully licensed, or where IDV is voluntary rather than mandatory, is safest first.
  2. Start where the vendor is strongest. Vendor performance varies enormously by country; sequence to early wins.
  3. Match volume to operations capacity. Don't ramp your largest country on day 1; you'll overwhelm manual review.
  4. End with your most strategic country. By the time you're rolling out in your top market, you've absorbed all the lessons.

Example sequence (illustrative, US-headquartered fintech expanding to EU + LATAM)

WeekCountries activatedWhy
1–2Internal / employees globallyValidate integration end-to-end
3–4Ireland, Portugal (small EU markets, vendor strong)Real users, low blast radius
5–6+ Spain, NetherlandsBigger EU markets, vendor familiar
7–8+ Germany (with care; BaFin scrutiny high)Largest regulated EU market
9–10+ Mexico, ColombiaDifferent document mix; tests EM coverage
11–12+ BrazilLargest LATAM market
13–14+ United KingdomFCA scrutiny — saved until you've stabilized
15++ United StatesLargest market; everything must be working
Per-country metrics matter more than aggregate

You can have a healthy global completion rate and a 20-point regression in Mexico, masked by US volume. Always gate per country, not just globally, and build dashboards split by country before launch.

Rollback criteria

Rollback is not failure — it's the system working. Pre-commit to rollback triggers so the team isn't deciding under pressure. Anyone on the on-call rotation should be able to invoke rollback unilaterally based on these triggers.

Automatic rollback triggers

  • Vendor API error rate > 5% sustained 10 minutes.
  • p95 latency > 5× normal sustained 15 minutes.
  • Completion rate down > 5 points day-over-day on a stable population.
  • Webhook delivery failure rate > 1% sustained 30 minutes.
  • Vendor declares P1 incident.

Human-judgment rollback triggers

  • Manual review queue backing up beyond ops capacity.
  • Cluster of user reports indicating a regression invisible to metrics.
  • Compliance / MLRO observes audit-trail issues.
  • Regulator inquiry.
  • Unit cost spiking beyond 20% of forecast.

Rollback procedure

  1. Page on-call + IDV PM + MLRO/CCO via the same channel (don't fragment).
  2. Confirm trigger from dashboards (5-minute timebox, then act).
  3. Flip idv.new_vendor.pct to 0 in your config service.
  4. Verify in monitoring that new traffic is on old vendor within 5 minutes.
  5. For in-flight verifications: let them complete naturally; do not re-route mid-flow.
  6. Comms: internal post in #incidents within 15 minutes. External comms only if user-facing impact > 5 minutes.
  7. Post-incident: RCA within 5 business days. Decide go/no-go on next attempt.
What to keep running when you've rolled back

Keep the new vendor receiving 1% of traffic in shadow mode (decision logged, not acted on) so you can validate the fix when the vendor reports it's resolved. Don't take them out completely — the cost of re-onboarding is too high.