The Scorecard
38 weighted criteria across capability, performance, cost, compliance, integration, operations, and commercial. With kill switches that auto-disqualify, a worked example across four vendors, and the math that turns it into a single score.
Scope first — before the scorecard
The scorecard only makes sense once you know what you're scoring against. Pin these answers before writing weights. Disagreement between Product, Compliance, and Finance on any of these is the most common reason scorecards collapse mid-evaluation.
| Decision | Why it matters | Example answer |
|---|---|---|
| What are you verifying? | Consumer IDV, KYB, age, sanctions only, or all of the above | Consumer IDV + sanctions/PEP screening + KYB-lite for partner businesses |
| Under which regulatory regimes? | Different regimes require different evidence and retention | NYDFS BitLicense, FCA, MAS, BaFin, MiCA |
| Which countries by Day 365? | Coverage breadth is a vendor-level constraint | US + EU-27 + UK + SG + AU; not LATAM until 2027 |
| Tiered KYC or single tier? | Determines re-verification economics | Two-tier: light at signup, full at $500 cumulative |
| Expected annual volume? | Tier pricing band; vendor enterprise readiness | 1.2M IDV checks/year, 80K KYB checks/year |
| Acceptable false-reject ceiling? | The business constraint, not a vendor metric | ≤2% false reject on a population matched to ours |
| Build-vs-buy on decision logic? | Determines if you need vendor's workflow engine | Buy decisioning; we won't build a rules engine year-1 |
Kill switches — anything that auto-disqualifies
A kill switch is a hard floor. Vendors that fail any of these are eliminated before scoring. Be ruthless here; it saves weeks downstream. Customize for your situation — every kill switch should have an owner who'll defend it.
| # | Kill switch | Why | Owner |
|---|---|---|---|
| K1 | No SOC 2 Type II report current within 12 months | Required by every meaningful regulated counterparty; non-negotiable | Security |
| K2 | No GDPR-compliant DPA they'll sign | Cannot lawfully transfer EU PII | Legal / DPO |
| K3 | No EU data residency option (if you serve EU users) | Schrems II + many regulators require it | Legal |
| K4 | No production sandbox we can hit in < 24h | You cannot run a bake-off without one | Engineering |
| K5 | No published uptime SLA, or SLA < 99.5% | IDV outage = signup outage = revenue outage | Engineering / SRE |
| K6 | No documented evidence of regulator acceptance in your top-3 jurisdictions | Letter from a customer, audit report citation, or examiner reference | Compliance / MLRO |
| K7 | Won't share aggregated performance metrics from a comparable customer | Marketing claims aren't measurements | Product |
| K8 | Minimum commitment > 18 months with no exit clause | You'll regret it | Procurement / Finance |
| K9 | Pre-trained biometric models with no documented bias testing | NIST FRVT or equivalent; legal exposure | Legal / Product |
| K10 | No webhook or async-result API (synchronous-only) | Blocks scalable orchestration; forces poll loops | Engineering |
Either a vendor passes all kill switches and goes into the scorecard, or they don't and they're out. Don't be tempted to "score them low on K6" instead of cutting them — that's how vendors who shouldn't have been on the shortlist end up signing.
Weighting the seven categories
Weights are not universal. They reflect your business stage and regulatory posture. A pre-launch fintech weights compliance and capability higher; a scale-stage business with an existing book weights cost and operations higher. Here are three reference weightings as starting points.
| Category | Pre-launch fintech | Scale-stage fintech | Regulated bank / EMI |
|---|---|---|---|
| Capability — coverage, biometrics, KYB | 22% | 15% | 20% |
| Performance — false-reject, completion, latency | 18% | 22% | 15% |
| Cost — per-check, AML hits, minimums, total TCO | 10% | 18% | 10% |
| Compliance — certs, residency, regulator acceptance | 20% | 12% | 25% |
| Integration — API quality, SDK platforms, webhooks | 15% | 13% | 10% |
| Operations — SLA, support, dispute mechanism | 8% | 12% | 12% |
| Commercial — flexibility, exit, term | 7% | 8% | 8% |
| Total | 100% | 100% | 100% |
Run a 60-minute workshop. Each stakeholder (PM, Eng lead, MLRO, CFO delegate, Support lead) gets 100 "tokens" to distribute across the seven categories independently. Average the results, then debate variances of more than 5 points. Write down the rationale in a doc; you'll need it when someone re-litigates in week 8.
The 38-criterion scorecard
Each criterion is scored 0–5. Within a category, all criteria are weighted equally unless flagged. Mark a criterion (★) to give it 2× weight inside its category.
Scoring rubric (0–5)
| Score | Meaning |
|---|---|
| 5 | Best-in-class. Exceeds requirement with margin; verified independently. |
| 4 | Strong. Meets requirement; minor caveats or vendor-attested only. |
| 3 | Acceptable. Meets minimum. Not differentiating. |
| 2 | Weak. Has gaps. Workaround required. |
| 1 | Poor. Major gap; only viable if other vendors fail too. |
| 0 | Absent. Vendor doesn't offer this at all. |
Capability (7 criteria)
| ID | Criterion | Probe |
|---|---|---|
| C1 ★ | Document coverage by country (top-N markets) | Provide the actual list of document types per country; not "200 countries supported" |
| C2 | Biometric matching accuracy | NIST FRVT 1:1 published score; vendor's published FAR/FRR |
| C3 ★ | Liveness detection (passive + active) | iBeta Level-2 PAD certification; spoofing test results |
| C4 | Sanctions / PEP / adverse-media integration | Native or via partner? Which lists (OFAC, EU, UK, UN, HMT)? Refresh cadence |
| C5 | KYB support (entity formation, UBO, control person) | Coverage of UBO data quality by jurisdiction; corporate registry depth |
| C6 | Re-verification & ongoing monitoring | Continuous screening for sanctions; re-KYC triggers; cost model |
| C7 | Workflow / decisioning engine | Can you author rules? Versioning? A/B routing? |
Performance (5 criteria)
| ID | Criterion | Probe |
|---|---|---|
| P1 ★ | False-reject rate on a population like yours | Demand: by country, by age cohort, by doc type. ≤2% is a strong number for consumer IDV |
| P2 | False-accept (fraud pass-through) rate | Hardest to measure; ask for a case study with downstream fraud outcome |
| P3 ★ | Completion rate (start → submit → verified) | 85%+ is good for unguided consumer; 65–75% for KYB |
| P4 | Median & p95 time-to-decision | Auto-decisions should be sub-30s; manual queue median < 4h |
| P5 | Mobile-web vs in-app SDK performance gap | The gap matters; vendors hide it |
Cost (5 criteria)
| ID | Criterion | Probe |
|---|---|---|
| $1 ★ | Per-check pricing at your volume tier | Public range: $0.50–$5.00 for consumer IDV; $3–$15 for KYB. Demand a 3-tier curve |
| $2 | AML / PEP / sanctions hit pricing | Is a screening a separate billable event? Per-list? Per-hit-investigation? |
| $3 | Re-verification & ongoing-monitoring fees | Often hidden; can be 20–40% of TCO |
| $4 ★ | Minimum commitment / overages / shortfalls | Annual minimum, monthly minimum, overage rate, shortfall penalty |
| $5 | "Other" line-items: manual review, dispute, data export | The line-item list itself is the diagnostic |
Compliance (6 criteria)
| ID | Criterion | Probe |
|---|---|---|
| CO1 | SOC 2 Type II | Current report dated within 12 months; review the actual report, not the cert page |
| CO2 | ISO 27001 (and 27701 for privacy) | Certificate copy + scope statement |
| CO3 ★ | GDPR posture + EU data residency | EU-region processing? Sub-processor list? SCCs / IDTA in DPA? |
| CO4 ★ | Regulator acceptance in your top-3 jurisdictions | Named regulated customers; examiner-letter precedent; case-law analogues |
| CO5 | Bias / fairness testing (biometric) | Published test results by demographic; NIST FRVT demographic breakdowns |
| CO6 | Data retention configurability | Per-jurisdiction retention; right-to-erasure mechanism; audit log retention |
Integration (6 criteria)
| ID | Criterion | Probe |
|---|---|---|
| I1 ★ | API quality (REST/gRPC, idempotency, errors) | Read the docs. Look for idempotency keys, retry semantics, error taxonomy |
| I2 | SDK coverage (iOS, Android, Web, RN, Flutter) | Native quality varies; RN/Flutter often regressed |
| I3 ★ | Webhook delivery, signing, replay | HMAC, replay window, dead-letter queue, idempotency |
| I4 | Sandbox quality & test fixtures | Synthetic doc bank? Forced-failure modes? Recorded scenarios? |
| I5 | Time-to-first-verification in sandbox | ≤ 4 engineer-hours is excellent; ≥ 2 days suggests a hostile DX |
| I6 | White-label / branding flexibility | Custom CSS? Translatable strings? Logo placement in liveness flow? |
Operations (5 criteria)
| ID | Criterion | Probe |
|---|---|---|
| O1 ★ | Uptime SLA & credit mechanism | 99.9% on the synchronous API minimum; credits as a % of monthly fees |
| O2 | Support tiers & response times | P1 / P2 / P3 definitions; named contact; Slack / shared channel |
| O3 | Dispute / appeal mechanism for users | Who handles a user who insists they're not a fraudster? SLA? |
| O4 | Incident communication & status page | RCA in < 5 business days; status page with subscription |
| O5 | Customer success / TAM | Named CSM? Quarterly business review? Roadmap influence? |
Commercial (4 criteria)
| ID | Criterion | Probe |
|---|---|---|
| M1 ★ | Contract term flexibility | Month-to-month available? 1-year vs 3-year delta? Auto-renewal terms? |
| M2 | Exit / termination clauses | Termination-for-convenience window; data export at termination; transition assistance |
| M3 | MSA red-flags | Liability cap; indemnification scope; IP assignment on improvements; audit rights |
| M4 | Price-change protection | CPI cap on annual increases; price-renegotiation triggers (e.g., 2× volume, 50% volume drop) |
Worked example — four vendors, pre-launch fintech weights
This is illustrative. Real scores depend on your testing, your jurisdictions, your population. Use this to see how the math behaves, not to learn the answers — and never use a generic scorecard score as a substitute for a sandbox bake-off.
The numbers below are reasonable archetypes ("the configurable workflow vendor", "the enterprise incumbent", "the EU-regulated specialist", "the emerging-markets one-stop"), not statements about specific vendors' actual current performance. Run your own bake-off.
| Category (weight) | Vendor A configurable workflow |
Vendor B enterprise incumbent |
Vendor C EU regulated specialist |
Vendor D EM one-stop |
|---|---|---|---|---|
| Capability (22%) | 4.1 | 4.5 | 3.6 | 4.4 |
| Performance (18%) | 4.0 | 3.4 | 3.8 | 3.5 |
| Cost (10%) | 3.5 | 2.6 | 3.7 | 4.2 |
| Compliance (20%) | 3.8 | 4.4 | 4.6 | 3.7 |
| Integration (15%) | 4.6 | 3.2 | 3.5 | 3.9 |
| Operations (8%) | 3.9 | 4.2 | 3.8 | 3.6 |
| Commercial (7%) | 4.0 | 2.9 | 3.3 | 3.8 |
| Weighted total | 4.04 | 3.78 | 3.86 | 3.91 |
Vendor A wins this scorecard narrowly, but the spread (3.78–4.04) is tight enough that any single re-scored row could flip it. That's normal and it's a feature — it means you've reached the point where qualitative judgment, reference calls, and the bake-off matter more than the scorecard.
What the scorecard tells you when it's close
- Which dimensions are differentiated. Cost (4.2 vs 2.6) and Integration (4.6 vs 3.2) are the dimensions with real spread. Those should drive the negotiation.
- Where every vendor is mediocre. If everyone scores ~3.5 on Performance, none of them have measured what you need. That's a signal to demand the bake-off be on your data, not theirs.
- Where the bar is high enough. If everyone is ≥ 4.0 on Compliance, you can stop investing scoring cycles there and reallocate.
Show the math
The scorecard is a weighted mean. Each category score is the arithmetic mean of its 0–5 criterion scores (with starred criteria counting twice). The final score is the weighted mean of category scores. Both are stable to ±0.05 against small re-scorings — which is intentionally larger than typical inter-rater variance.
Per-category score
def category_score(criteria):
"""
criteria: list of (score_0_to_5, weight_1_or_2)
"""
total_weight = sum(w for _, w in criteria)
total_score = sum(s * w for s, w in criteria)
return total_score / total_weight
# Capability example for Vendor A
capability_A = category_score([
(4, 2), # C1 doc coverage (starred = weight 2)
(4, 1), # C2 biometric
(5, 2), # C3 liveness (starred)
(4, 1), # C4 sanctions
(3, 1), # C5 KYB
(4, 1), # C6 re-verification
(4, 1), # C7 workflow
])
# = (8 + 4 + 10 + 4 + 3 + 4 + 4) / 9
# = 37 / 9
# = 4.11Weighted total
CATEGORY_WEIGHTS = {
"capability": 0.22,
"performance": 0.18,
"cost": 0.10,
"compliance": 0.20,
"integration": 0.15,
"operations": 0.08,
"commercial": 0.07,
}
def total_score(vendor):
return sum(
vendor[cat] * w
for cat, w in CATEGORY_WEIGHTS.items()
)
vendor_A = {
"capability": 4.1, "performance": 4.0, "cost": 3.5,
"compliance": 3.8, "integration": 4.6, "operations": 3.9,
"commercial": 4.0,
}
# total_score(vendor_A) == 4.04Spreadsheet equivalent
For non-coders, the same math in a Google Sheet:
A B C D E F
1 Category Weight Vendor A Vendor B Vendor C Vendor D
2 Capability 0.22 4.1 4.5 3.6 4.4
3 Performance 0.18 4.0 3.4 3.8 3.5
…
9 TOTAL =SUMPRODUCT($B$2:$B$8,C2:C8) (drag right)Export & templates
Copy the JSON skeleton into your tool of choice (Notion DB, Coda, Airtable, sheet). It mirrors the structure above so vendor responses, criterion scores, and weights all live in one schema.
{
"scorecard_version": "1.0",
"weights": {
"capability": 0.22, "performance": 0.18, "cost": 0.10,
"compliance": 0.20, "integration": 0.15, "operations": 0.08,
"commercial": 0.07
},
"kill_switches": [
{"id": "K1", "label": "SOC 2 Type II current", "owner": "security"},
{"id": "K2", "label": "GDPR DPA signed", "owner": "legal"}
],
"criteria": [
{"id": "C1", "category": "capability", "label": "Doc coverage by country", "weight_multiplier": 2},
{"id": "C2", "category": "capability", "label": "Biometric matching", "weight_multiplier": 1},
{"id": "C3", "category": "capability", "label": "Liveness (passive + active)", "weight_multiplier": 2}
],
"vendors": [
{
"name": "Vendor A",
"kill_switches_passed": ["K1", "K2"],
"scores": {"C1": 4, "C2": 4, "C3": 5}
}
]
}Why per-criterion (not per-category) scoring matters
It's tempting to score directly at the category level — "Vendor A is a 4 on Capability." Don't. Two reasons:
- You'll forget the rationale. When stakeholders re-litigate the choice in month 3, "Capability: 4" is unfalsifiable. "C1 doc coverage: 4 because they support 184 of our top 200 countries" is.
- Vendors will negotiate at the criterion level. If you say "we scored you 3 on Performance," they have nothing actionable. If you say "we scored you 2 on P1 because your false-reject in our LATAM bake-off was 4.1%," they can either fix it, give you a price concession, or you both know to walk.
Should we use a single number or pareto-rank vendors?
Both. The weighted score is the headline (good for stakeholder communication). The Pareto view (no vendor dominates another on all 7 categories) is what you actually use to decide. If Vendor A is best on 4 categories and Vendor B is best on 3, you have a real choice. If Vendor A is best on 6 of 7, the decision is made.