Section B · Technical core

Error Handling & Failure Modes

Payments fails in dozens of distinct ways. Strong PMs recognize each one and reach for "what could go wrong" before "what's cool." Here are the categories that show up in production and in design rounds.

Failure-mode taxonomy

Failure categoryExampleOwnerRecovery
Issuer-side declineHard / soft decline codesIssuer behaviour; PM informs retry policySmart retry, rail switch
Network outageVisa, Mastercard, NPCI rail downNetworkFailover routes, customer comms
PSP outageAdyen API errors at elevated rateVendor; you have a secondaryCascade to alternate PSP
Acquirer settlement gapFunds not in nostro on expected dateAcquirer + treasuryReconcile, escalate
3DS abandonmentUser drops out of step-upUX + issuer flowRe-offer; alternate rail
FX / currency mismatchWrong currency creditedTreasury + PSPManual reconciliation
Instant-rail recallPIX MED, UPI returnRecipient bankHold suspect funds; investigate
Duplicate processingSame txn processed twiceYou — idempotency designIdempotency keys; reverse one
Wallet/balance mismatchInternal ledger out of syncYou — reconciliationDaily recon job; manual fix

Soft declines & recovery patterns

Soft declines are recoverable. Hard declines are scheme-violating to retry. The shape of recovery:

  • Retry on same rail — for timeout, "do not honor" with appropriate spacing.
  • Cascade to alternate PSP — same auth attempt; preserves user intent.
  • 3DS step-up retry — if soft decline suggests risk; liability shifts.
  • Delayed retry — "insufficient funds" — wait, try again. Common for recurring.
  • User-facing rail switch — "Your card was declined. Try Open Banking?"

The retry policy is data-driven, not opinion-driven. Run an A/B on each rule.

PSP / partner outage

Real story: it's 03:00 UTC, your primary acquirer has elevated 5xx rates. Three actions on your runbook:

  1. Detect: AAR drop, latency spike, 5xx spike. Page in payments-ops.
  2. Auto-failover: routing layer detects health degradation and shifts traffic to secondary. The "auto" is the senior signal — don't require a human to push a button.
  3. Comms: status page; in-app banner for affected geos; ops Slack pager out.
  4. Backfill: capture intents that failed; queue for retry once primary recovers.
  5. Vendor escalation: open P1; gather logs; SLA-credit conversation in post-mortem.

The hardest part is the long tail — partial degradation where success rate drops 5% but not enough to obvious-trigger failover. PM job: set the SLO and the threshold.

Mid-flight 3DS abandonment

Customer enters card, gets bounced to issuer 3DS challenge, never completes. From your data, it looks like a successful auth attempt that "expired." Treat carefully:

  • Distinguish 3DS-initiated-no-response from 3DS-failed.
  • Reach out: in-app reminder "Your bank requires confirmation, finish here."
  • Offer an alternative rail explicitly: "Or pay via PIX / Open Banking."
  • Make sure expired 3DS sessions don't double-charge if user retries.

FX & currency mismatch

The expensive failure mode: customer pays BRL 500, your PSP credits you USD at a stale rate, your ledger credits the customer crypto at a different rate. Net result: a few cents off per txn — but at scale, a structural P&L hole.

  • Lock the quote, honor it.
  • Reconcile daily — what FX rate did the PSP actually apply, vs what you quoted? Net the difference into a treasury account.
  • If you're consistently losing, renegotiate spread or change FX provider.

Deposit-withdraw fraud loop

The canonical crypto-payments fraud: attacker deposits via card, immediately withdraws to crypto, then disputes the card payment as chargeback. Funds gone, exchange eats the loss.

PM defenses:

  • Velocity rules: high-amount deposit + immediate crypto withdrawal = elevated risk score.
  • Withdrawal hold periods on card deposits (24-72h common).
  • 3DS-enforced on first card deposit (liability shift).
  • Deposit-method-aware withdrawal restrictions (deposit via card → withdraw to same card only, for X days).
  • Issuer/BIN-level risk scoring (some BINs run hot for friendly fraud).

Tradeoff: every defense hurts conversion for the 99.x% of legitimate users. The PM tunes the threshold.

Instant-rail recalls — PIX MED, UPI return

Instant rails are almost irreversible. The "almost" matters.

RailRecall mechanismWindow
PIXMED (fraud return), MEC (operational error)MED: 80 days to file; recipient bank has 7 days to hold
UPINPCI dispute mechanism via PSPVaries; typically tight (days)
SEPA InstantRecall request; recipient cooperation requiredShort; recipient may decline
FPSConfirmation of Payee + APP fraud rules (UK)Reimbursement regime mandates response within set windows

PM implication: a "settled" txn isn't truly final. Build a hold-window UX for inbound instant-rail receipts to high-risk patterns. Communicate clearly to the customer.

Idempotency & replay

Every payment API call must accept an idempotency key. If the same key arrives twice (because the client retried after a timeout), the system returns the original response, not a second charge. Without this, duplicate-charge failures happen at scale every day. PM rule: idempotency is a launch blocker, not a nice-to-have.

-- Sanity check: any payment intents with multiple successful auths?
SELECT
  payment_intent_id,
  COUNT(*) FILTER (WHERE status='approved') AS approved_count
FROM auth_attempts
WHERE created_at >= NOW() - INTERVAL '7 days'
GROUP BY payment_intent_id
HAVING COUNT(*) FILTER (WHERE status='approved') > 1;
-- Expectation: zero rows. Any row is a duplicate-charge incident.

A general error-handling playbook

Walk this through in a system-design round when the interviewer says "what about failures?":

  1. Classify the failure: where in the chain, who owns it, transient vs persistent.
  2. Decide reversibility: can you undo? Or do you need to compensate?
  3. Detect early: SLO breach triggers automation, not a human.
  4. Failover with idempotency: retry safely; never double-charge.
  5. Communicate honestly: customer-facing copy, status page, internal pager.
  6. Reconcile: at end of day, match expectations to reality; surface deltas.
  7. Post-mortem: blameless; commit to one structural fix and one detection improvement.

Strong candidates land on this skeleton naturally. Memorize it.