Error Handling & Failure Modes
Payments fails in dozens of distinct ways. Strong PMs recognize each one and reach for "what could go wrong" before "what's cool." Here are the categories that show up in production and in design rounds.
Failure-mode taxonomy
| Failure category | Example | Owner | Recovery |
|---|---|---|---|
| Issuer-side decline | Hard / soft decline codes | Issuer behaviour; PM informs retry policy | Smart retry, rail switch |
| Network outage | Visa, Mastercard, NPCI rail down | Network | Failover routes, customer comms |
| PSP outage | Adyen API errors at elevated rate | Vendor; you have a secondary | Cascade to alternate PSP |
| Acquirer settlement gap | Funds not in nostro on expected date | Acquirer + treasury | Reconcile, escalate |
| 3DS abandonment | User drops out of step-up | UX + issuer flow | Re-offer; alternate rail |
| FX / currency mismatch | Wrong currency credited | Treasury + PSP | Manual reconciliation |
| Instant-rail recall | PIX MED, UPI return | Recipient bank | Hold suspect funds; investigate |
| Duplicate processing | Same txn processed twice | You — idempotency design | Idempotency keys; reverse one |
| Wallet/balance mismatch | Internal ledger out of sync | You — reconciliation | Daily recon job; manual fix |
Soft declines & recovery patterns
Soft declines are recoverable. Hard declines are scheme-violating to retry. The shape of recovery:
- Retry on same rail — for timeout, "do not honor" with appropriate spacing.
- Cascade to alternate PSP — same auth attempt; preserves user intent.
- 3DS step-up retry — if soft decline suggests risk; liability shifts.
- Delayed retry — "insufficient funds" — wait, try again. Common for recurring.
- User-facing rail switch — "Your card was declined. Try Open Banking?"
The retry policy is data-driven, not opinion-driven. Run an A/B on each rule.
PSP / partner outage
Real story: it's 03:00 UTC, your primary acquirer has elevated 5xx rates. Three actions on your runbook:
- Detect: AAR drop, latency spike, 5xx spike. Page in payments-ops.
- Auto-failover: routing layer detects health degradation and shifts traffic to secondary. The "auto" is the senior signal — don't require a human to push a button.
- Comms: status page; in-app banner for affected geos; ops Slack pager out.
- Backfill: capture intents that failed; queue for retry once primary recovers.
- Vendor escalation: open P1; gather logs; SLA-credit conversation in post-mortem.
The hardest part is the long tail — partial degradation where success rate drops 5% but not enough to obvious-trigger failover. PM job: set the SLO and the threshold.
Mid-flight 3DS abandonment
Customer enters card, gets bounced to issuer 3DS challenge, never completes. From your data, it looks like a successful auth attempt that "expired." Treat carefully:
- Distinguish 3DS-initiated-no-response from 3DS-failed.
- Reach out: in-app reminder "Your bank requires confirmation, finish here."
- Offer an alternative rail explicitly: "Or pay via PIX / Open Banking."
- Make sure expired 3DS sessions don't double-charge if user retries.
FX & currency mismatch
The expensive failure mode: customer pays BRL 500, your PSP credits you USD at a stale rate, your ledger credits the customer crypto at a different rate. Net result: a few cents off per txn — but at scale, a structural P&L hole.
- Lock the quote, honor it.
- Reconcile daily — what FX rate did the PSP actually apply, vs what you quoted? Net the difference into a treasury account.
- If you're consistently losing, renegotiate spread or change FX provider.
Deposit-withdraw fraud loop
The canonical crypto-payments fraud: attacker deposits via card, immediately withdraws to crypto, then disputes the card payment as chargeback. Funds gone, exchange eats the loss.
PM defenses:
- Velocity rules: high-amount deposit + immediate crypto withdrawal = elevated risk score.
- Withdrawal hold periods on card deposits (24-72h common).
- 3DS-enforced on first card deposit (liability shift).
- Deposit-method-aware withdrawal restrictions (deposit via card → withdraw to same card only, for X days).
- Issuer/BIN-level risk scoring (some BINs run hot for friendly fraud).
Tradeoff: every defense hurts conversion for the 99.x% of legitimate users. The PM tunes the threshold.
Instant-rail recalls — PIX MED, UPI return
Instant rails are almost irreversible. The "almost" matters.
| Rail | Recall mechanism | Window |
|---|---|---|
| PIX | MED (fraud return), MEC (operational error) | MED: 80 days to file; recipient bank has 7 days to hold |
| UPI | NPCI dispute mechanism via PSP | Varies; typically tight (days) |
| SEPA Instant | Recall request; recipient cooperation required | Short; recipient may decline |
| FPS | Confirmation of Payee + APP fraud rules (UK) | Reimbursement regime mandates response within set windows |
PM implication: a "settled" txn isn't truly final. Build a hold-window UX for inbound instant-rail receipts to high-risk patterns. Communicate clearly to the customer.
Idempotency & replay
Every payment API call must accept an idempotency key. If the same key arrives twice (because the client retried after a timeout), the system returns the original response, not a second charge. Without this, duplicate-charge failures happen at scale every day. PM rule: idempotency is a launch blocker, not a nice-to-have.
-- Sanity check: any payment intents with multiple successful auths?
SELECT
payment_intent_id,
COUNT(*) FILTER (WHERE status='approved') AS approved_count
FROM auth_attempts
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY payment_intent_id
HAVING COUNT(*) FILTER (WHERE status='approved') > 1;
-- Expectation: zero rows. Any row is a duplicate-charge incident.
A general error-handling playbook
Walk this through in a system-design round when the interviewer says "what about failures?":
- Classify the failure: where in the chain, who owns it, transient vs persistent.
- Decide reversibility: can you undo? Or do you need to compensate?
- Detect early: SLO breach triggers automation, not a human.
- Failover with idempotency: retry safely; never double-charge.
- Communicate honestly: customer-facing copy, status page, internal pager.
- Reconcile: at end of day, match expectations to reality; surface deltas.
- Post-mortem: blameless; commit to one structural fix and one detection improvement.
Strong candidates land on this skeleton naturally. Memorize it.