Section B · Technical core

Error Handling & Failure Modes

Payments fails in dozens of distinct ways. Strong PMs recognize each one and reach for "what could go wrong" before "what's cool." Here are the categories that show up in production and in design rounds.

Failure-mode taxonomy

Failure category	Example	Owner	Recovery
Issuer-side decline	Hard / soft decline codes	Issuer behaviour; PM informs retry policy	Smart retry, rail switch
Network outage	Visa, Mastercard, NPCI rail down	Network	Failover routes, customer comms
PSP outage	Adyen API errors at elevated rate	Vendor; you have a secondary	Cascade to alternate PSP
Acquirer settlement gap	Funds not in nostro on expected date	Acquirer + treasury	Reconcile, escalate
3DS abandonment	User drops out of step-up	UX + issuer flow	Re-offer; alternate rail
FX / currency mismatch	Wrong currency credited	Treasury + PSP	Manual reconciliation
Instant-rail recall	PIX MED, UPI return	Recipient bank	Hold suspect funds; investigate
Duplicate processing	Same txn processed twice	You — idempotency design	Idempotency keys; reverse one
Wallet/balance mismatch	Internal ledger out of sync	You — reconciliation	Daily recon job; manual fix

Soft declines & recovery patterns

Soft declines are recoverable. Hard declines are scheme-violating to retry. The shape of recovery:

Retry on same rail — for timeout, "do not honor" with appropriate spacing.
Cascade to alternate PSP — same auth attempt; preserves user intent.
3DS step-up retry — if soft decline suggests risk; liability shifts.
Delayed retry — "insufficient funds" — wait, try again. Common for recurring.
User-facing rail switch — "Your card was declined. Try Open Banking?"

The retry policy is data-driven, not opinion-driven. Run an A/B on each rule.

PSP / partner outage

Real story: it's 03:00 UTC, your primary acquirer has elevated 5xx rates. Three actions on your runbook:

Detect: AAR drop, latency spike, 5xx spike. Page in payments-ops.
Auto-failover: routing layer detects health degradation and shifts traffic to secondary. The "auto" is the senior signal — don't require a human to push a button.
Comms: status page; in-app banner for affected geos; ops Slack pager out.
Backfill: capture intents that failed; queue for retry once primary recovers.
Vendor escalation: open P1; gather logs; SLA-credit conversation in post-mortem.

The hardest part is the long tail — partial degradation where success rate drops 5% but not enough to obvious-trigger failover. PM job: set the SLO and the threshold.

Mid-flight 3DS abandonment

Customer enters card, gets bounced to issuer 3DS challenge, never completes. From your data, it looks like a successful auth attempt that "expired." Treat carefully:

Distinguish 3DS-initiated-no-response from 3DS-failed.
Reach out: in-app reminder "Your bank requires confirmation, finish here."
Offer an alternative rail explicitly: "Or pay via PIX / Open Banking."
Make sure expired 3DS sessions don't double-charge if user retries.

FX & currency mismatch

The expensive failure mode: customer pays BRL 500, your PSP credits you USD at a stale rate, your ledger credits the customer crypto at a different rate. Net result: a few cents off per txn — but at scale, a structural P&L hole.

Lock the quote, honor it.
Reconcile daily — what FX rate did the PSP actually apply, vs what you quoted? Net the difference into a treasury account.
If you're consistently losing, renegotiate spread or change FX provider.

Deposit-withdraw fraud loop

The canonical crypto-payments fraud: attacker deposits via card, immediately withdraws to crypto, then disputes the card payment as chargeback. Funds gone, exchange eats the loss.

PM defenses:

Velocity rules: high-amount deposit + immediate crypto withdrawal = elevated risk score.
Withdrawal hold periods on card deposits (24-72h common).
3DS-enforced on first card deposit (liability shift).
Deposit-method-aware withdrawal restrictions (deposit via card → withdraw to same card only, for X days).
Issuer/BIN-level risk scoring (some BINs run hot for friendly fraud).

Tradeoff: every defense hurts conversion for the 99.x% of legitimate users. The PM tunes the threshold.

Instant-rail recalls — PIX MED, UPI return

Instant rails are almost irreversible. The "almost" matters.

Rail	Recall mechanism	Window
PIX	MED (fraud return), MEC (operational error)	MED: 80 days to file; recipient bank has 7 days to hold
UPI	NPCI dispute mechanism via PSP	Varies; typically tight (days)
SEPA Instant	Recall request; recipient cooperation required	Short; recipient may decline
FPS	Confirmation of Payee + APP fraud rules (UK)	Reimbursement regime mandates response within set windows

PM implication: a "settled" txn isn't truly final. Build a hold-window UX for inbound instant-rail receipts to high-risk patterns. Communicate clearly to the customer.

Idempotency & replay

Every payment API call must accept an idempotency key. If the same key arrives twice (because the client retried after a timeout), the system returns the original response, not a second charge. Without this, duplicate-charge failures happen at scale every day. PM rule: idempotency is a launch blocker, not a nice-to-have.

-- Sanity check: any payment intents with multiple successful auths?
SELECT
  payment_intent_id,
  COUNT(*) FILTER (WHERE status='approved') AS approved_count
FROM auth_attempts
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY payment_intent_id
HAVING COUNT(*) FILTER (WHERE status='approved') > 1;
-- Expectation: zero rows. Any row is a duplicate-charge incident.

A general error-handling playbook

Walk this through in a system-design round when the interviewer says "what about failures?":

Classify the failure: where in the chain, who owns it, transient vs persistent.
Decide reversibility: can you undo? Or do you need to compensate?
Detect early: SLO breach triggers automation, not a human.
Failover with idempotency: retry safely; never double-charge.
Communicate honestly: customer-facing copy, status page, internal pager.
Reconcile: at end of day, match expectations to reality; surface deltas.
Post-mortem: blameless; commit to one structural fix and one detection improvement.

Strong candidates land on this skeleton naturally. Memorize it.