Pilot & Rollout
From sandbox bake-off through to 100% production traffic. Sample sizes, success metrics, the 0% → 1% → 10% → 50% → 100% ramp, the geo-staging plan, and the rollback triggers.
Phase 0 — Sandbox bake-off (pre-signature)
This happens before you sign the contract, during the negotiation phase. It's the single highest-yield activity in the whole evaluation. RFP responses are claims; the bake-off is evidence.
What you run
Take a fixed sample set of synthetic verifications — covering each jurisdiction in scope, each document type you expect, each demographic mix, plus deliberate failure cases — and run it through each shortlisted vendor's sandbox. Score the results yourself.
Sample set composition (200–500 cases total)
| Bucket | What | Approx % |
|---|---|---|
| Happy path | Clean docs, good lighting, expected demographics | 40% |
| Difficult-but-valid | Older docs, glare, partial occlusion, accents, low-end Android phones | 25% |
| Demographic edge cases | Skin-tone diversity, age extremes, transgender (legal name vs current presentation), glasses, beards | 10% |
| Document edge cases | Recently issued docs, soon-to-expire, military ID, refugee travel doc, residence permit | 10% |
| Synthetic fraud (positive) | Known spoofs: printed photos, screen replay, deepfake (where ethically obtainable) | 10% |
| Synthetic invalid (negative) | Cropped docs, blurry, wrong-doc-type, expired | 5% |
Don't try to spoof vendors with real fraud documents. Use vendor-provided test fixtures or publicly published academic datasets. The goal is consistency across vendors, not actually committing fraud. Any synthetic data you generate should be clearly labeled, retained securely, and destroyed at end of evaluation.
What to record per vendor
- Decision (approve / decline / review / indeterminate) for every case.
- Time-to-decision (median, p95).
- Vendor reason codes (do they map sensibly to your taxonomy?).
- Confidence score where available.
- API errors, timeouts, retries.
- Cost per check (compare against quoted pricing).
Engineer time
Allow 1 engineer week per vendor for sandbox setup + integration + running the sample set. Time-to-first-verification is itself a scored criterion — a vendor whose sandbox eats 3 days is signalling integration cost.
What to do if the vendor refuses a structured bake-off
Sales teams sometimes resist structured bake-offs because they expose weaknesses against competitors. If a vendor refuses or insists on running the test themselves with their data: that's a kill-switch trigger (revisit K7 in 01 § Kill switches). The polite framing: "We're not asking for anything more than what we'd do once live; this is our standard pre-signature evaluation."
Phase 1 — Pilot design (post-signature, pre-full-launch)
The pilot is where you stop simulating and start running real production traffic. The goal: validate every assumption from the RFP and sandbox in your live environment, with real users, before you depend on the vendor for revenue.
Pilot structure
| Stage | % traffic | Duration | Purpose |
|---|---|---|---|
| P0 — Internal alpha | 0% | 3–5 days | Employees only; verify integration, webhooks, manual-review console |
| P1 — Closed beta | 0% | 1 week | Allowlist of ~50 friendly users; full flow including support & dispute |
| P2 — Production pilot | 1% | 2 weeks | Stable hash-routed; measure all metrics |
| P3 — Production ramp | 10% | 2 weeks | Scale validates throughput, support load, dispute volume |
| P4 — Production majority | 50% | 2 weeks | Half of traffic; vendor is now load-bearing |
| P5 — Full production | 100% | Steady state | Cutover; old vendor in fallback mode for 30 days |
Each gate must pass exit criteria (below) before advancing. Plan ~9–10 weeks from go-live to 100%. Don't rush it; the only thing more expensive than a slow rollout is a fast one that breaks.
Sample sizes — how long does each stage need?
Sample size depends on what you're trying to detect. The numbers below assume p=0.05, 80% power, and the typical metric ranges for consumer IDV. Use them as a planning floor.
| What you're detecting | Baseline | Detectable diff | Sample size per arm |
|---|---|---|---|
| Completion rate change | 85% | 1 pt (84% vs 85%) | ~25,000 |
| Completion rate change | 85% | 2 pt (83% vs 85%) | ~6,500 |
| False-reject rate change | 2.0% | 0.5 pt | ~50,000 |
| False-reject rate change | 2.0% | 1.0 pt | ~13,000 |
| p95 latency regression | 30s | 5s | ~500 (Wilcoxon) |
| Manual-review rate change | 5.0% | 1 pt | ~7,500 |
How to plan
- Estimate your daily verification volume.
- Decide which metrics are deal-breakers (typically: completion rate, false-reject rate).
- Compute the days at each % level needed to hit stat-sig on each metric.
- Pick the longer of: (sample-size duration) and (operational-load duration — i.e., long enough that support/dispute volume stabilizes).
If you can hit stat-sig in 3 days but a support-load spike could take 10 days to surface, run 14 days minimum at each gate. Stat-sig is a floor, not a ceiling.
What to measure
Build the dashboards in 04 § Observability before launch. These are the metrics that gate progression at each stage.
Primary metrics (deal-breakers)
| Metric | Definition | Pilot threshold (typical) |
|---|---|---|
| Completion rate | started → verified, of users who initiated | ≥ baseline – 2 pts |
| Auto-approval rate | approved without manual review, of verified | ≥ baseline – 3 pts |
| False-reject rate | rejected users who pass dispute/appeal, of rejected | ≤ baseline + 0.5 pt |
| Median time-to-decision | full path: started → final decision | ≤ 60s for auto-decisions |
| p95 time-to-decision | same, p95 | ≤ 2.5 min for auto-decisions |
| Manual-review queue depth at peak | open reviews waiting > 1 hour | Below your ops capacity |
Secondary metrics (watch list)
- Dispute rate — users contesting a decline.
- NPS / CSAT on the verification flow (in-product survey).
- Cost per successful verification (blended unit cost / completion rate).
- Sanctions hit rate and hit-investigation throughput.
- Webhook delivery success rate (≥ 99.9% required).
- API error rate by vendor 5xx / 4xx / timeout.
- Geographic distribution of decline rate — surface jurisdiction-specific regressions.
Tertiary metrics (post-launch monitoring)
- Downstream fraud rate on approved users (lagging indicator; needs 30–90 days).
- Re-verification trigger rate.
- Cohort-aged unit economics.
- Vendor invoice variance vs forecast.
Pilot exit criteria — gates between stages
Each gate is a hard yes/no. Fail any criterion at a gate, you stay at that % level until it's resolved. Don't soften the gates under pressure; that's how bad launches happen.
- Completion rate ≥ pre-pilot baseline – 2 pts
- False-reject rate ≤ pre-pilot baseline + 0.5 pt
- p95 time-to-decision ≤ 2.5 min for auto-decisions
- Webhook delivery ≥ 99.9% during measurement window
- Vendor API error rate ≤ 1% on 5-min windows, no sustained spikes
- Manual review queue did not exceed 4× baseline peak depth
- Support ticket volume on verification topics within 1.5× baseline
- Dispute rate within 1.5× baseline
- No vendor SLA breach during stage
- No P1 incidents attributable to the integration
- Compliance / MLRO sign-off on the audit trail review (random sample of 100 decisions)
- Cost per successful verification within 10% of forecast
Phased rollout: 0% → 1% → 10% → 50% → 100%
How to assign users to stages
Use a deterministic hash of user_id mod 10,000 — gives you fine-grained control. Sticky assignment is mandatory — a user who starts on the new vendor stays on it, even if their verification fails and they retry.
import hashlib
def route_vendor(user_id: str, rollout_pct: float, sticky: bool = True) -> str:
"""
rollout_pct: 0.0–100.0, what % of users should go to new vendor.
Sticky: same user always gets the same answer for the same rollout_pct.
"""
h = int(hashlib.sha256(f"vendor-rollout:{user_id}".encode()).hexdigest()[:8], 16)
bucket = (h % 10000) / 100.0 # 0.00–99.99
return "vendor_new" if bucket < rollout_pct else "vendor_old"
# Wired to LaunchDarkly / Statsig / your config service:
ROLLOUT_PCT = config.get("idv.new_vendor.pct", default=0.0)Stage-by-stage actions
| Stage | Eng actions | Ops actions | Comms actions |
|---|---|---|---|
| 0% (dark launch) | Shadow-mode: call both vendors, compare results, never use new vendor's decision | Manual review team trained on the new console | Internal-only; no user-facing change |
| 1% | Real decisions for 1% of new users | Daily check-in on manual review queue | None user-facing |
| 10% | No code changes; observe | Twice-weekly check-in; review 100 random decisions | Optional: in-product banner if flow visually changes |
| 50% | Confirm cost forecast holds; tune any rules | Weekly review | Customer support brief on differences (if any) |
| 100% | Old vendor in fallback / write-only for 30 days | Decommission planning for old vendor | None unless retention SLA changes for user-facing artifacts |
Run both vendors in parallel — new vendor's decisions logged but not acted on — for 7–14 days before the 1% gate. The disagreement analysis (when do they decide differently? on what populations?) is the single best risk-mitigation artifact you can produce. Free, except for the vendor unit costs.
Geo-staging playbook for multi-jurisdiction launches
If you operate in 5+ countries, don't ramp everywhere simultaneously. Stage by jurisdiction. Order matters.
Ordering principles
- Start where you have low regulatory exposure. A country where you're not yet fully licensed, or where IDV is voluntary rather than mandatory, is safest first.
- Start where the vendor is strongest. Vendor performance varies enormously by country; sequence to early wins.
- Match volume to operations capacity. Don't ramp your largest country on day 1; you'll overwhelm manual review.
- End with your most strategic country. By the time you're rolling out in your top market, you've absorbed all the lessons.
Example sequence (illustrative, US-headquartered fintech expanding to EU + LATAM)
| Week | Countries activated | Why |
|---|---|---|
| 1–2 | Internal / employees globally | Validate integration end-to-end |
| 3–4 | Ireland, Portugal (small EU markets, vendor strong) | Real users, low blast radius |
| 5–6 | + Spain, Netherlands | Bigger EU markets, vendor familiar |
| 7–8 | + Germany (with care; BaFin scrutiny high) | Largest regulated EU market |
| 9–10 | + Mexico, Colombia | Different document mix; tests EM coverage |
| 11–12 | + Brazil | Largest LATAM market |
| 13–14 | + United Kingdom | FCA scrutiny — saved until you've stabilized |
| 15+ | + United States | Largest market; everything must be working |
You can have a healthy global completion rate and a 20-point regression in Mexico, masked by US volume. Always gate per country, not just globally, and build dashboards split by country before launch.
Rollback criteria
Rollback is not failure — it's the system working. Pre-commit to rollback triggers so the team isn't deciding under pressure. Anyone on the on-call rotation should be able to invoke rollback unilaterally based on these triggers.
Automatic rollback triggers
- Vendor API error rate > 5% sustained 10 minutes.
- p95 latency > 5× normal sustained 15 minutes.
- Completion rate down > 5 points day-over-day on a stable population.
- Webhook delivery failure rate > 1% sustained 30 minutes.
- Vendor declares P1 incident.
Human-judgment rollback triggers
- Manual review queue backing up beyond ops capacity.
- Cluster of user reports indicating a regression invisible to metrics.
- Compliance / MLRO observes audit-trail issues.
- Regulator inquiry.
- Unit cost spiking beyond 20% of forecast.
Rollback procedure
- Page on-call + IDV PM + MLRO/CCO via the same channel (don't fragment).
- Confirm trigger from dashboards (5-minute timebox, then act).
- Flip
idv.new_vendor.pctto 0 in your config service. - Verify in monitoring that new traffic is on old vendor within 5 minutes.
- For in-flight verifications: let them complete naturally; do not re-route mid-flow.
- Comms: internal post in #incidents within 15 minutes. External comms only if user-facing impact > 5 minutes.
- Post-incident: RCA within 5 business days. Decide go/no-go on next attempt.
What to keep running when you've rolled back
Keep the new vendor receiving 1% of traffic in shadow mode (decision logged, not acted on) so you can validate the fix when the vendor reports it's resolved. Don't take them out completely — the cost of re-onboarding is too high.