Chapter 01 · Working artifact

The Scorecard

38 weighted criteria across capability, performance, cost, compliance, integration, operations, and commercial. With kill switches that auto-disqualify, a worked example across four vendors, and the math that turns it into a single score.

Scope first — before the scorecard

The scorecard only makes sense once you know what you're scoring against. Pin these answers before writing weights. Disagreement between Product, Compliance, and Finance on any of these is the most common reason scorecards collapse mid-evaluation.

DecisionWhy it mattersExample answer
What are you verifying?Consumer IDV, KYB, age, sanctions only, or all of the aboveConsumer IDV + sanctions/PEP screening + KYB-lite for partner businesses
Under which regulatory regimes?Different regimes require different evidence and retentionNYDFS BitLicense, FCA, MAS, BaFin, MiCA
Which countries by Day 365?Coverage breadth is a vendor-level constraintUS + EU-27 + UK + SG + AU; not LATAM until 2027
Tiered KYC or single tier?Determines re-verification economicsTwo-tier: light at signup, full at $500 cumulative
Expected annual volume?Tier pricing band; vendor enterprise readiness1.2M IDV checks/year, 80K KYB checks/year
Acceptable false-reject ceiling?The business constraint, not a vendor metric≤2% false reject on a population matched to ours
Build-vs-buy on decision logic?Determines if you need vendor's workflow engineBuy decisioning; we won't build a rules engine year-1

Kill switches — anything that auto-disqualifies

A kill switch is a hard floor. Vendors that fail any of these are eliminated before scoring. Be ruthless here; it saves weeks downstream. Customize for your situation — every kill switch should have an owner who'll defend it.

#Kill switchWhyOwner
K1No SOC 2 Type II report current within 12 monthsRequired by every meaningful regulated counterparty; non-negotiableSecurity
K2No GDPR-compliant DPA they'll signCannot lawfully transfer EU PIILegal / DPO
K3No EU data residency option (if you serve EU users)Schrems II + many regulators require itLegal
K4No production sandbox we can hit in < 24hYou cannot run a bake-off without oneEngineering
K5No published uptime SLA, or SLA < 99.5%IDV outage = signup outage = revenue outageEngineering / SRE
K6No documented evidence of regulator acceptance in your top-3 jurisdictionsLetter from a customer, audit report citation, or examiner referenceCompliance / MLRO
K7Won't share aggregated performance metrics from a comparable customerMarketing claims aren't measurementsProduct
K8Minimum commitment > 18 months with no exit clauseYou'll regret itProcurement / Finance
K9Pre-trained biometric models with no documented bias testingNIST FRVT or equivalent; legal exposureLegal / Product
K10No webhook or async-result API (synchronous-only)Blocks scalable orchestration; forces poll loopsEngineering
Use kill switches as a filter, not a tiebreaker

Either a vendor passes all kill switches and goes into the scorecard, or they don't and they're out. Don't be tempted to "score them low on K6" instead of cutting them — that's how vendors who shouldn't have been on the shortlist end up signing.

Weighting the seven categories

Weights are not universal. They reflect your business stage and regulatory posture. A pre-launch fintech weights compliance and capability higher; a scale-stage business with an existing book weights cost and operations higher. Here are three reference weightings as starting points.

Category Pre-launch fintech Scale-stage fintech Regulated bank / EMI
Capability — coverage, biometrics, KYB22%15%20%
Performance — false-reject, completion, latency18%22%15%
Cost — per-check, AML hits, minimums, total TCO10%18%10%
Compliance — certs, residency, regulator acceptance20%12%25%
Integration — API quality, SDK platforms, webhooks15%13%10%
Operations — SLA, support, dispute mechanism8%12%12%
Commercial — flexibility, exit, term7%8%8%
Total100%100%100%
How to set weights without a fight

Run a 60-minute workshop. Each stakeholder (PM, Eng lead, MLRO, CFO delegate, Support lead) gets 100 "tokens" to distribute across the seven categories independently. Average the results, then debate variances of more than 5 points. Write down the rationale in a doc; you'll need it when someone re-litigates in week 8.

The 38-criterion scorecard

Each criterion is scored 0–5. Within a category, all criteria are weighted equally unless flagged. Mark a criterion (★) to give it 2× weight inside its category.

Scoring rubric (0–5)

ScoreMeaning
5Best-in-class. Exceeds requirement with margin; verified independently.
4Strong. Meets requirement; minor caveats or vendor-attested only.
3Acceptable. Meets minimum. Not differentiating.
2Weak. Has gaps. Workaround required.
1Poor. Major gap; only viable if other vendors fail too.
0Absent. Vendor doesn't offer this at all.

Capability (7 criteria)

IDCriterionProbe
C1 ★Document coverage by country (top-N markets)Provide the actual list of document types per country; not "200 countries supported"
C2Biometric matching accuracyNIST FRVT 1:1 published score; vendor's published FAR/FRR
C3 ★Liveness detection (passive + active)iBeta Level-2 PAD certification; spoofing test results
C4Sanctions / PEP / adverse-media integrationNative or via partner? Which lists (OFAC, EU, UK, UN, HMT)? Refresh cadence
C5KYB support (entity formation, UBO, control person)Coverage of UBO data quality by jurisdiction; corporate registry depth
C6Re-verification & ongoing monitoringContinuous screening for sanctions; re-KYC triggers; cost model
C7Workflow / decisioning engineCan you author rules? Versioning? A/B routing?

Performance (5 criteria)

IDCriterionProbe
P1 ★False-reject rate on a population like yoursDemand: by country, by age cohort, by doc type. ≤2% is a strong number for consumer IDV
P2False-accept (fraud pass-through) rateHardest to measure; ask for a case study with downstream fraud outcome
P3 ★Completion rate (start → submit → verified)85%+ is good for unguided consumer; 65–75% for KYB
P4Median & p95 time-to-decisionAuto-decisions should be sub-30s; manual queue median < 4h
P5Mobile-web vs in-app SDK performance gapThe gap matters; vendors hide it

Cost (5 criteria)

IDCriterionProbe
$1 ★Per-check pricing at your volume tierPublic range: $0.50–$5.00 for consumer IDV; $3–$15 for KYB. Demand a 3-tier curve
$2AML / PEP / sanctions hit pricingIs a screening a separate billable event? Per-list? Per-hit-investigation?
$3Re-verification & ongoing-monitoring feesOften hidden; can be 20–40% of TCO
$4 ★Minimum commitment / overages / shortfallsAnnual minimum, monthly minimum, overage rate, shortfall penalty
$5"Other" line-items: manual review, dispute, data exportThe line-item list itself is the diagnostic

Compliance (6 criteria)

IDCriterionProbe
CO1SOC 2 Type IICurrent report dated within 12 months; review the actual report, not the cert page
CO2ISO 27001 (and 27701 for privacy)Certificate copy + scope statement
CO3 ★GDPR posture + EU data residencyEU-region processing? Sub-processor list? SCCs / IDTA in DPA?
CO4 ★Regulator acceptance in your top-3 jurisdictionsNamed regulated customers; examiner-letter precedent; case-law analogues
CO5Bias / fairness testing (biometric)Published test results by demographic; NIST FRVT demographic breakdowns
CO6Data retention configurabilityPer-jurisdiction retention; right-to-erasure mechanism; audit log retention

Integration (6 criteria)

IDCriterionProbe
I1 ★API quality (REST/gRPC, idempotency, errors)Read the docs. Look for idempotency keys, retry semantics, error taxonomy
I2SDK coverage (iOS, Android, Web, RN, Flutter)Native quality varies; RN/Flutter often regressed
I3 ★Webhook delivery, signing, replayHMAC, replay window, dead-letter queue, idempotency
I4Sandbox quality & test fixturesSynthetic doc bank? Forced-failure modes? Recorded scenarios?
I5Time-to-first-verification in sandbox≤ 4 engineer-hours is excellent; ≥ 2 days suggests a hostile DX
I6White-label / branding flexibilityCustom CSS? Translatable strings? Logo placement in liveness flow?

Operations (5 criteria)

IDCriterionProbe
O1 ★Uptime SLA & credit mechanism99.9% on the synchronous API minimum; credits as a % of monthly fees
O2Support tiers & response timesP1 / P2 / P3 definitions; named contact; Slack / shared channel
O3Dispute / appeal mechanism for usersWho handles a user who insists they're not a fraudster? SLA?
O4Incident communication & status pageRCA in < 5 business days; status page with subscription
O5Customer success / TAMNamed CSM? Quarterly business review? Roadmap influence?

Commercial (4 criteria)

IDCriterionProbe
M1 ★Contract term flexibilityMonth-to-month available? 1-year vs 3-year delta? Auto-renewal terms?
M2Exit / termination clausesTermination-for-convenience window; data export at termination; transition assistance
M3MSA red-flagsLiability cap; indemnification scope; IP assignment on improvements; audit rights
M4Price-change protectionCPI cap on annual increases; price-renegotiation triggers (e.g., 2× volume, 50% volume drop)

Worked example — four vendors, pre-launch fintech weights

This is illustrative. Real scores depend on your testing, your jurisdictions, your population. Use this to see how the math behaves, not to learn the answers — and never use a generic scorecard score as a substitute for a sandbox bake-off.

Vendor names are illustrative

The numbers below are reasonable archetypes ("the configurable workflow vendor", "the enterprise incumbent", "the EU-regulated specialist", "the emerging-markets one-stop"), not statements about specific vendors' actual current performance. Run your own bake-off.

Category (weight) Vendor A
configurable workflow
Vendor B
enterprise incumbent
Vendor C
EU regulated specialist
Vendor D
EM one-stop
Capability (22%)4.14.53.64.4
Performance (18%)4.03.43.83.5
Cost (10%)3.52.63.74.2
Compliance (20%)3.84.44.63.7
Integration (15%)4.63.23.53.9
Operations (8%)3.94.23.83.6
Commercial (7%)4.02.93.33.8
Weighted total4.043.783.863.91

Vendor A wins this scorecard narrowly, but the spread (3.78–4.04) is tight enough that any single re-scored row could flip it. That's normal and it's a feature — it means you've reached the point where qualitative judgment, reference calls, and the bake-off matter more than the scorecard.

What the scorecard tells you when it's close

  • Which dimensions are differentiated. Cost (4.2 vs 2.6) and Integration (4.6 vs 3.2) are the dimensions with real spread. Those should drive the negotiation.
  • Where every vendor is mediocre. If everyone scores ~3.5 on Performance, none of them have measured what you need. That's a signal to demand the bake-off be on your data, not theirs.
  • Where the bar is high enough. If everyone is ≥ 4.0 on Compliance, you can stop investing scoring cycles there and reallocate.

Show the math

The scorecard is a weighted mean. Each category score is the arithmetic mean of its 0–5 criterion scores (with starred criteria counting twice). The final score is the weighted mean of category scores. Both are stable to ±0.05 against small re-scorings — which is intentionally larger than typical inter-rater variance.

Per-category score

def category_score(criteria):
    """
    criteria: list of (score_0_to_5, weight_1_or_2)
    """
    total_weight = sum(w for _, w in criteria)
    total_score  = sum(s * w for s, w in criteria)
    return total_score / total_weight

# Capability example for Vendor A
capability_A = category_score([
    (4, 2),  # C1 doc coverage (starred = weight 2)
    (4, 1),  # C2 biometric
    (5, 2),  # C3 liveness (starred)
    (4, 1),  # C4 sanctions
    (3, 1),  # C5 KYB
    (4, 1),  # C6 re-verification
    (4, 1),  # C7 workflow
])
# = (8 + 4 + 10 + 4 + 3 + 4 + 4) / 9
# = 37 / 9
# = 4.11

Weighted total

CATEGORY_WEIGHTS = {
    "capability":  0.22,
    "performance": 0.18,
    "cost":        0.10,
    "compliance":  0.20,
    "integration": 0.15,
    "operations":  0.08,
    "commercial":  0.07,
}

def total_score(vendor):
    return sum(
        vendor[cat] * w
        for cat, w in CATEGORY_WEIGHTS.items()
    )

vendor_A = {
    "capability":  4.1, "performance": 4.0, "cost":        3.5,
    "compliance":  3.8, "integration": 4.6, "operations":  3.9,
    "commercial":  4.0,
}
# total_score(vendor_A) == 4.04

Spreadsheet equivalent

For non-coders, the same math in a Google Sheet:

A          B           C          D          E          F
1 Category   Weight      Vendor A   Vendor B   Vendor C   Vendor D
2 Capability 0.22        4.1        4.5        3.6        4.4
3 Performance 0.18       4.0        3.4        3.8        3.5
…
9 TOTAL                  =SUMPRODUCT($B$2:$B$8,C2:C8)  (drag right)

Export & templates

Copy the JSON skeleton into your tool of choice (Notion DB, Coda, Airtable, sheet). It mirrors the structure above so vendor responses, criterion scores, and weights all live in one schema.

{
  "scorecard_version": "1.0",
  "weights": {
    "capability": 0.22, "performance": 0.18, "cost": 0.10,
    "compliance": 0.20, "integration": 0.15, "operations": 0.08,
    "commercial": 0.07
  },
  "kill_switches": [
    {"id": "K1", "label": "SOC 2 Type II current", "owner": "security"},
    {"id": "K2", "label": "GDPR DPA signed",       "owner": "legal"}
  ],
  "criteria": [
    {"id": "C1", "category": "capability", "label": "Doc coverage by country", "weight_multiplier": 2},
    {"id": "C2", "category": "capability", "label": "Biometric matching",      "weight_multiplier": 1},
    {"id": "C3", "category": "capability", "label": "Liveness (passive + active)", "weight_multiplier": 2}
  ],
  "vendors": [
    {
      "name": "Vendor A",
      "kill_switches_passed": ["K1", "K2"],
      "scores": {"C1": 4, "C2": 4, "C3": 5}
    }
  ]
}
Why per-criterion (not per-category) scoring matters

It's tempting to score directly at the category level — "Vendor A is a 4 on Capability." Don't. Two reasons:

  1. You'll forget the rationale. When stakeholders re-litigate the choice in month 3, "Capability: 4" is unfalsifiable. "C1 doc coverage: 4 because they support 184 of our top 200 countries" is.
  2. Vendors will negotiate at the criterion level. If you say "we scored you 3 on Performance," they have nothing actionable. If you say "we scored you 2 on P1 because your false-reject in our LATAM bake-off was 4.1%," they can either fix it, give you a price concession, or you both know to walk.
Should we use a single number or pareto-rank vendors?

Both. The weighted score is the headline (good for stakeholder communication). The Pareto view (no vendor dominates another on all 7 categories) is what you actually use to decide. If Vendor A is best on 4 categories and Vendor B is best on 3, you have a real choice. If Vendor A is best on 6 of 7, the decision is made.