Section B · Critical

Audit Trails for AI Systems

In compliance contexts, this is the most important non-AI part of the AI system. Get it right and you can ship; get it wrong and Legal blocks you.

What an audit trail actually has to do

A regulator (or internal auditor, or counsel) walks in tomorrow and says: "Show me how AI X handled case Y on date Z." You must be able to produce, within hours, a complete, tamper-evident, non-repudiable record of:

What input was received (the alert, the customer data, the document).
Who or what triggered the AI (user ID, system, schedule).
What the AI was at that moment (model version, prompt version, tool definitions, retrieval index version).
What the AI did (every model call, every tool call, every retrieval, every intermediate step).
What it produced (the output, draft, recommendation).
Who reviewed and approved/edited (human in the loop).
What action was actually taken in the world.
When each of the above happened (precise, synchronized timestamps).

The bar

Build for that bar from day one — bolt-on audit logging is always insufficient.

The four properties of a good audit trail

Complete — every relevant event is recorded. No silent steps.
Immutable / append-only — old records cannot be modified or quietly deleted.
Attributable — every record has identity (user, system, agent run ID).
Reproducible — given the inputs + recorded state, you can recreate (or at least explain) the output.

In regulated finance, immutability often translates to specific compliance requirements: WORM (Write Once Read Many) storage for certain records, retention periods (5+ years for AML), tamper-evident logging (e.g. hash chains).

Reproducibility — the hard one

LLMs are non-deterministic by default. Even with temperature=0, infrastructure-level non-determinism (batching, kernel selection) means same input → slightly different output. So strict bitwise reproducibility isn't achievable.

What you can reproduce:

The exact prompt that was sent.
The model version and parameters.
The tool definitions at that moment.
The retrieval results (if you log them).
The resulting output (recorded, not regenerated).

What you cannot reproduce:

A guarantee that re-running today gives the same answer.

What to commit to

Explainability over reproducibility. You can always show what was done and why. You cannot guarantee a re-run is identical. Most regulators accept this — they care that the record is complete and consistent, not that you can re-derive bit-for-bit.

What to log — the canonical event schema

Design the audit event schema before writing the agent. A reasonable skeleton:

audit event schema

{
  "event_id": "evt_01H...",
  "trace_id": "trace_01H...",
  "parent_event_id": "evt_01H...",
  "timestamp": "2026-05-08T13:42:18.234Z",
  "actor": {
    "type": "agent | user | system",
    "id": "agent_alert_triage_v3",
    "run_id": "run_01H..."
  },
  "event_type": "model_call | tool_call | tool_result | human_review | decision | external_action",
  "context": {
    "case_id": "case_12345",
    "alert_id": "alert_67890",
    "user_id": "customer_abc"
  },
  "snapshot": {
    "model": "claude-opus-4-7",
    "prompt_version": "alert_triage_prompt@v3.2",
    "tool_versions": {"lookup_kyc": "v1.4"},
    "retrieval_index_version": "regs_2026_05_07"
  },
  "input": { "/* full or hash-with-storage-reference */": null },
  "output": { "/* full or hash-with-storage-reference */": null },
  "metadata": {
    "input_tokens": 1842,
    "output_tokens": 412,
    "latency_ms": 1287,
    "cost_usd": 0.024
  },
  "outcome": "success | error | escalated",
  "error": null
}

Notes:

trace_id groups all events for one task. parent_event_id establishes causality.
Snapshot captures versions at-the-time. Critical — you cannot reconstruct the state of alert_triage_prompt@v3.2 from production six months later unless you saved it then.
Input/output: full text in a content-addressed store; the event has a hash/reference to keep events small.
Cost in audit because compliance reports often cite resource use.

Versioning what the agent sees

For audit, treat these as first-class versioned artifacts:

Artifact	How to version
Prompt templates	Prompt registry; semver; commit-tied
Tool schemas	Tool descriptions and input schemas have versions
Retrieval indexes	Snapshot ID, recorded with each retrieval
Knowledge base documents	Source URL + version + retrieved-at timestamp
Models	Pinned model identifier, never aliases
Eval datasets	Dataset versions; track which version each release was scored against
Workflow definitions (n8n)	Workflow IDs + versions; export the JSON, store in source

A regulator's question "what prompt was used on January 12?" must have a definitive answer.

Storage architecture

Common pattern:

Hot path: events written to an append-only stream (Kafka, Postgres with write-only role, AWS QLDB, AWS CloudTrail-style infra).
Indexed search: events also indexed in a query store (OpenSearch, BigQuery, Postgres) for ad-hoc analysis.
Cold storage / WORM: long-term archive in object storage with object-lock (S3 Object Lock, GCS retention policies).
Content store: large payloads (full prompts, full outputs, retrieved docs) stored content-addressed (SHA-256-keyed); event records hold the hash.
Hash chains: each event includes hash of prior event, making tampering detectable. Light-weight tamper-evidence without a full blockchain.

Don't over-engineer

Postgres with a strict append-only role + S3 with object-lock + hash-chained events covers most regulated use cases.

Privacy vs audit — the tension

Regulators want long retention; privacy law (GDPR) wants minimization and erasure. Reconciling:

Tokenization at the boundary: replace customer identifiers with internal tokens before sending to the model. The audit log refers to tokens; a separate, tightly-controlled mapping resolves tokens → identities. Right-to-erasure becomes "delete the mapping," and audit logs remain intact and useful for aggregate analysis without revealing identity.
Pseudonymization: similar — model sees customer_T0123 not "Jane Doe".
Field-level redaction: high-PII fields hashed/redacted at log time; full data in a separate restricted store.
Retention tiering: different events may have different retention (full prompts: 90 days; metadata + decisions: 7 years).
Legal review: data flows that cross borders or third-party APIs need explicit approval.

This is also the right answer to "how do you handle PII?" — tokenization at the boundary, not as a vibes-based redaction afterwards.

Audit trails for human-in-the-loop

Don't forget the human part. Log:

Who reviewed (user ID, role).
When they reviewed (timestamp).
What they saw (the AI draft + supporting data hash).
What they did (approve, edit, reject, escalate).
What changes they made (diff between AI draft and final action).
Their stated reasoning (if captured).

The diff is gold

Over time, the AI-draft-to-final-action diff tells you where the AI is consistently wrong (sucking down human time on a particular type of edit), and feeds into eval datasets.

Compliance-specific audit requirements

You won't need to recite chapter and verse, but signal awareness:

BSA / FinCEN: SAR filings have strict retention; the decision process for filing or not filing should be documented.
GDPR: data subject rights — access, rectification, erasure, portability. Audit log should support these requests.
MiFID II / market regulations: transaction-related decisions often have specific record-keeping rules.
EU AI Act: certain high-risk AI uses require documentation: data governance, human oversight, transparency, logging.
NYDFS Part 500 / similar cyber regs: incident logs for breaches.
Model risk management (SR 11-7 in US banking): model inventory, model validation records, ongoing monitoring.

Model governance / model risk management

A separate-but-adjacent discipline. The org must:

Maintain an inventory of every AI model and use in production.
Document each model's purpose, data, performance, limitations.
Validate models before production (pre-deploy evals, red-teaming).
Monitor in production (online evals, drift detection).
Re-validate on changes (new model version, new prompt, new data source).
Have an owner for each model.

Compliance likely has a Model Risk Management (MRM) function. As an architect, your role is to make their life easy: produce the validation documents, the eval reports, the monitoring dashboards. Build for review.

n8n's audit advantages

A talking point: n8n's native execution history records every workflow run with input/output at each node. For compliance, that's a giant head start — you don't have to build the trace, the platform produces it. You augment with content-addressing and pinning details (model version, prompt version) that n8n alone doesn't capture.

Tools / platforms worth name-dropping

OpenTelemetry GenAI conventions — emerging standard for LLM tracing.
Langfuse — open-source LLM observability with tracing/evals.
Phoenix (Arize) — open-source observability + evals.
Datadog LLM Observability / New Relic AI Monitoring — APM-vendor integrations.
Comet, MLflow, Weights & Biases — broader ML lifecycle tracking.
AWS CloudTrail / Azure Monitor — for the infra audit layer.

Talking-point: "how do you build audit trails for AI agents?"

~90 seconds out loud

"I'd start by treating the prompt, the tool schemas, the retrieval index, and the model version as first-class versioned artifacts — not as code constants. Every agent run gets a trace ID; every step (model call, tool call, human review) becomes an immutable event tied to that trace, with snapshots of which versions were active. Inputs and outputs go to content-addressed storage so events stay light. Side-effecting tools have idempotency keys recorded. PII gets tokenized at the boundary so audit logs are useful without leaking identity. The whole thing is append-only, retained per the AML requirement, indexable for queries, and reproducible enough to answer a regulator's 'why did the AI do this' within hours."

It says: I've thought about this from compliance-first, not "let me bolt logging on later."