Chapter 01 · Working artifact

The Scorecard

38 weighted criteria across capability, performance, cost, compliance, integration, operations, and commercial. With kill switches that auto-disqualify, a worked example across four vendors, and the math that turns it into a single score.

Scope first — before the scorecard

The scorecard only makes sense once you know what you're scoring against. Pin these answers before writing weights. Disagreement between Product, Compliance, and Finance on any of these is the most common reason scorecards collapse mid-evaluation.

Decision	Why it matters	Example answer
What are you verifying?	Consumer IDV, KYB, age, sanctions only, or all of the above	Consumer IDV + sanctions/PEP screening + KYB-lite for partner businesses
Under which regulatory regimes?	Different regimes require different evidence and retention	NYDFS BitLicense, FCA, MAS, BaFin, MiCA
Which countries by Day 365?	Coverage breadth is a vendor-level constraint	US + EU-27 + UK + SG + AU; not LATAM until 2027
Tiered KYC or single tier?	Determines re-verification economics	Two-tier: light at signup, full at $500 cumulative
Expected annual volume?	Tier pricing band; vendor enterprise readiness	1.2M IDV checks/year, 80K KYB checks/year
Acceptable false-reject ceiling?	The business constraint, not a vendor metric	≤2% false reject on a population matched to ours
Build-vs-buy on decision logic?	Determines if you need vendor's workflow engine	Buy decisioning; we won't build a rules engine year-1

Kill switches — anything that auto-disqualifies

A kill switch is a hard floor. Vendors that fail any of these are eliminated before scoring. Be ruthless here; it saves weeks downstream. Customize for your situation — every kill switch should have an owner who'll defend it.

#	Kill switch	Why	Owner
K1	No SOC 2 Type II report current within 12 months	Required by every meaningful regulated counterparty; non-negotiable	Security
K2	No GDPR-compliant DPA they'll sign	Cannot lawfully transfer EU PII	Legal / DPO
K3	No EU data residency option (if you serve EU users)	Schrems II + many regulators require it	Legal
K4	No production sandbox we can hit in < 24h	You cannot run a bake-off without one	Engineering
K5	No published uptime SLA, or SLA < 99.5%	IDV outage = signup outage = revenue outage	Engineering / SRE
K6	No documented evidence of regulator acceptance in your top-3 jurisdictions	Letter from a customer, audit report citation, or examiner reference	Compliance / MLRO
K7	Won't share aggregated performance metrics from a comparable customer	Marketing claims aren't measurements	Product
K8	Minimum commitment > 18 months with no exit clause	You'll regret it	Procurement / Finance
K9	Pre-trained biometric models with no documented bias testing	NIST FRVT or equivalent; legal exposure	Legal / Product
K10	No webhook or async-result API (synchronous-only)	Blocks scalable orchestration; forces poll loops	Engineering

Use kill switches as a filter, not a tiebreaker

Either a vendor passes all kill switches and goes into the scorecard, or they don't and they're out. Don't be tempted to "score them low on K6" instead of cutting them — that's how vendors who shouldn't have been on the shortlist end up signing.

Weighting the seven categories

Weights are not universal. They reflect your business stage and regulatory posture. A pre-launch fintech weights compliance and capability higher; a scale-stage business with an existing book weights cost and operations higher. Here are three reference weightings as starting points.

Category	Pre-launch fintech	Scale-stage fintech	Regulated bank / EMI
Capability — coverage, biometrics, KYB	22%	15%	20%
Performance — false-reject, completion, latency	18%	22%	15%
Cost — per-check, AML hits, minimums, total TCO	10%	18%	10%
Compliance — certs, residency, regulator acceptance	20%	12%	25%
Integration — API quality, SDK platforms, webhooks	15%	13%	10%
Operations — SLA, support, dispute mechanism	8%	12%	12%
Commercial — flexibility, exit, term	7%	8%	8%
Total	100%	100%	100%

How to set weights without a fight

Run a 60-minute workshop. Each stakeholder (PM, Eng lead, MLRO, CFO delegate, Support lead) gets 100 "tokens" to distribute across the seven categories independently. Average the results, then debate variances of more than 5 points. Write down the rationale in a doc; you'll need it when someone re-litigates in week 8.

The 38-criterion scorecard

Each criterion is scored 0–5. Within a category, all criteria are weighted equally unless flagged. Mark a criterion (★) to give it 2× weight inside its category.

Scoring rubric (0–5)

Score	Meaning
5	Best-in-class. Exceeds requirement with margin; verified independently.
4	Strong. Meets requirement; minor caveats or vendor-attested only.
3	Acceptable. Meets minimum. Not differentiating.
2	Weak. Has gaps. Workaround required.
1	Poor. Major gap; only viable if other vendors fail too.
0	Absent. Vendor doesn't offer this at all.

Capability (7 criteria)

ID	Criterion	Probe
C1 ★	Document coverage by country (top-N markets)	Provide the actual list of document types per country; not "200 countries supported"
C2	Biometric matching accuracy	NIST FRVT 1:1 published score; vendor's published FAR/FRR
C3 ★	Liveness detection (passive + active)	iBeta Level-2 PAD certification; spoofing test results
C4	Sanctions / PEP / adverse-media integration	Native or via partner? Which lists (OFAC, EU, UK, UN, HMT)? Refresh cadence
C5	KYB support (entity formation, UBO, control person)	Coverage of UBO data quality by jurisdiction; corporate registry depth
C6	Re-verification & ongoing monitoring	Continuous screening for sanctions; re-KYC triggers; cost model
C7	Workflow / decisioning engine	Can you author rules? Versioning? A/B routing?

Performance (5 criteria)

ID	Criterion	Probe
P1 ★	False-reject rate on a population like yours	Demand: by country, by age cohort, by doc type. ≤2% is a strong number for consumer IDV
P2	False-accept (fraud pass-through) rate	Hardest to measure; ask for a case study with downstream fraud outcome
P3 ★	Completion rate (start → submit → verified)	85%+ is good for unguided consumer; 65–75% for KYB
P4	Median & p95 time-to-decision	Auto-decisions should be sub-30s; manual queue median < 4h
P5	Mobile-web vs in-app SDK performance gap	The gap matters; vendors hide it

Cost (5 criteria)

ID	Criterion	Probe
$1 ★	Per-check pricing at your volume tier	Public range: $0.50–$5.00 for consumer IDV; $3–$15 for KYB. Demand a 3-tier curve
$2	AML / PEP / sanctions hit pricing	Is a screening a separate billable event? Per-list? Per-hit-investigation?
$3	Re-verification & ongoing-monitoring fees	Often hidden; can be 20–40% of TCO
$4 ★	Minimum commitment / overages / shortfalls	Annual minimum, monthly minimum, overage rate, shortfall penalty
$5	"Other" line-items: manual review, dispute, data export	The line-item list itself is the diagnostic

Compliance (6 criteria)

ID	Criterion	Probe
CO1	SOC 2 Type II	Current report dated within 12 months; review the actual report, not the cert page
CO2	ISO 27001 (and 27701 for privacy)	Certificate copy + scope statement
CO3 ★	GDPR posture + EU data residency	EU-region processing? Sub-processor list? SCCs / IDTA in DPA?
CO4 ★	Regulator acceptance in your top-3 jurisdictions	Named regulated customers; examiner-letter precedent; case-law analogues
CO5	Bias / fairness testing (biometric)	Published test results by demographic; NIST FRVT demographic breakdowns
CO6	Data retention configurability	Per-jurisdiction retention; right-to-erasure mechanism; audit log retention

Integration (6 criteria)

ID	Criterion	Probe
I1 ★	API quality (REST/gRPC, idempotency, errors)	Read the docs. Look for idempotency keys, retry semantics, error taxonomy
I2	SDK coverage (iOS, Android, Web, RN, Flutter)	Native quality varies; RN/Flutter often regressed
I3 ★	Webhook delivery, signing, replay	HMAC, replay window, dead-letter queue, idempotency
I4	Sandbox quality & test fixtures	Synthetic doc bank? Forced-failure modes? Recorded scenarios?
I5	Time-to-first-verification in sandbox	≤ 4 engineer-hours is excellent; ≥ 2 days suggests a hostile DX
I6	White-label / branding flexibility	Custom CSS? Translatable strings? Logo placement in liveness flow?

Operations (5 criteria)

ID	Criterion	Probe
O1 ★	Uptime SLA & credit mechanism	99.9% on the synchronous API minimum; credits as a % of monthly fees
O2	Support tiers & response times	P1 / P2 / P3 definitions; named contact; Slack / shared channel
O3	Dispute / appeal mechanism for users	Who handles a user who insists they're not a fraudster? SLA?
O4	Incident communication & status page	RCA in < 5 business days; status page with subscription
O5	Customer success / TAM	Named CSM? Quarterly business review? Roadmap influence?

Commercial (4 criteria)

ID	Criterion	Probe
M1 ★	Contract term flexibility	Month-to-month available? 1-year vs 3-year delta? Auto-renewal terms?
M2	Exit / termination clauses	Termination-for-convenience window; data export at termination; transition assistance
M3	MSA red-flags	Liability cap; indemnification scope; IP assignment on improvements; audit rights
M4	Price-change protection	CPI cap on annual increases; price-renegotiation triggers (e.g., 2× volume, 50% volume drop)

Worked example — four vendors, pre-launch fintech weights

This is illustrative. Real scores depend on your testing, your jurisdictions, your population. Use this to see how the math behaves, not to learn the answers — and never use a generic scorecard score as a substitute for a sandbox bake-off.

Vendor names are illustrative

The numbers below are reasonable archetypes ("the configurable workflow vendor", "the enterprise incumbent", "the EU-regulated specialist", "the emerging-markets one-stop"), not statements about specific vendors' actual current performance. Run your own bake-off.

Category (weight)	Vendor A configurable workflow	Vendor B enterprise incumbent	Vendor C EU regulated specialist	Vendor D EM one-stop
Capability (22%)	4.1	4.5	3.6	4.4
Performance (18%)	4.0	3.4	3.8	3.5
Cost (10%)	3.5	2.6	3.7	4.2
Compliance (20%)	3.8	4.4	4.6	3.7
Integration (15%)	4.6	3.2	3.5	3.9
Operations (8%)	3.9	4.2	3.8	3.6
Commercial (7%)	4.0	2.9	3.3	3.8
Weighted total	4.04	3.78	3.86	3.91

Vendor A wins this scorecard narrowly, but the spread (3.78–4.04) is tight enough that any single re-scored row could flip it. That's normal and it's a feature — it means you've reached the point where qualitative judgment, reference calls, and the bake-off matter more than the scorecard.

What the scorecard tells you when it's close

Which dimensions are differentiated. Cost (4.2 vs 2.6) and Integration (4.6 vs 3.2) are the dimensions with real spread. Those should drive the negotiation.
Where every vendor is mediocre. If everyone scores ~3.5 on Performance, none of them have measured what you need. That's a signal to demand the bake-off be on your data, not theirs.
Where the bar is high enough. If everyone is ≥ 4.0 on Compliance, you can stop investing scoring cycles there and reallocate.

Show the math

The scorecard is a weighted mean. Each category score is the arithmetic mean of its 0–5 criterion scores (with starred criteria counting twice). The final score is the weighted mean of category scores. Both are stable to ±0.05 against small re-scorings — which is intentionally larger than typical inter-rater variance.

Per-category score

def category_score(criteria):
    """
    criteria: list of (score_0_to_5, weight_1_or_2)
    """
    total_weight = sum(w for _, w in criteria)
    total_score  = sum(s * w for s, w in criteria)
    return total_score / total_weight

# Capability example for Vendor A
capability_A = category_score([
    (4, 2),  # C1 doc coverage (starred = weight 2)
    (4, 1),  # C2 biometric
    (5, 2),  # C3 liveness (starred)
    (4, 1),  # C4 sanctions
    (3, 1),  # C5 KYB
    (4, 1),  # C6 re-verification
    (4, 1),  # C7 workflow
])
# = (8 + 4 + 10 + 4 + 3 + 4 + 4) / 9
# = 37 / 9
# = 4.11

Weighted total

CATEGORY_WEIGHTS = {
    "capability":  0.22,
    "performance": 0.18,
    "cost":        0.10,
    "compliance":  0.20,
    "integration": 0.15,
    "operations":  0.08,
    "commercial":  0.07,
}

def total_score(vendor):
    return sum(
        vendor[cat] * w
        for cat, w in CATEGORY_WEIGHTS.items()
    )

vendor_A = {
    "capability":  4.1, "performance": 4.0, "cost":        3.5,
    "compliance":  3.8, "integration": 4.6, "operations":  3.9,
    "commercial":  4.0,
}
# total_score(vendor_A) == 4.04

Spreadsheet equivalent

For non-coders, the same math in a Google Sheet:

A          B           C          D          E          F
1 Category   Weight      Vendor A   Vendor B   Vendor C   Vendor D
2 Capability 0.22        4.1        4.5        3.6        4.4
3 Performance 0.18       4.0        3.4        3.8        3.5
…
9 TOTAL                  =SUMPRODUCT($B$2:$B$8,C2:C8)  (drag right)

Export & templates

Copy the JSON skeleton into your tool of choice (Notion DB, Coda, Airtable, sheet). It mirrors the structure above so vendor responses, criterion scores, and weights all live in one schema.

{
  "scorecard_version": "1.0",
  "weights": {
    "capability": 0.22, "performance": 0.18, "cost": 0.10,
    "compliance": 0.20, "integration": 0.15, "operations": 0.08,
    "commercial": 0.07
  },
  "kill_switches": [
    {"id": "K1", "label": "SOC 2 Type II current", "owner": "security"},
    {"id": "K2", "label": "GDPR DPA signed",       "owner": "legal"}
  ],
  "criteria": [
    {"id": "C1", "category": "capability", "label": "Doc coverage by country", "weight_multiplier": 2},
    {"id": "C2", "category": "capability", "label": "Biometric matching",      "weight_multiplier": 1},
    {"id": "C3", "category": "capability", "label": "Liveness (passive + active)", "weight_multiplier": 2}
  ],
  "vendors": [
    {
      "name": "Vendor A",
      "kill_switches_passed": ["K1", "K2"],
      "scores": {"C1": 4, "C2": 4, "C3": 5}
    }
  ]
}

Why per-criterion (not per-category) scoring matters

It's tempting to score directly at the category level — "Vendor A is a 4 on Capability." Don't. Two reasons:

You'll forget the rationale. When stakeholders re-litigate the choice in month 3, "Capability: 4" is unfalsifiable. "C1 doc coverage: 4 because they support 184 of our top 200 countries" is.
Vendors will negotiate at the criterion level. If you say "we scored you 3 on Performance," they have nothing actionable. If you say "we scored you 2 on P1 because your false-reject in our LATAM bake-off was 4.1%," they can either fix it, give you a price concession, or you both know to walk.

Should we use a single number or pareto-rank vendors?

Both. The weighted score is the headline (good for stakeholder communication). The Pareto view (no vendor dominates another on all 7 categories) is what you actually use to decide. If Vendor A is best on 4 categories and Vendor B is best on 3, you have a real choice. If Vendor A is best on 6 of 7, the decision is made.