Designing audit trails for AI agents in banking

Three years ago, a Tier-1 bank we worked with received an explanation request from their national regulator. A small-business owner had been denied credit and filed a complaint. The bank's underwriting flow had recently been augmented with an LLM-based reasoning layer. The compliance officer asked us a deceptively simple question: "Can you tell me exactly why this application was rejected — every input, every rule, every model output, in the order it happened?"

It took the team a week to piece together. They got the answer, just barely. The regulator accepted it. But the lesson stuck: in regulated finance, your AI is only as good as the audit trail that proves how it decided.

This post lays out the architecture we now use on every banking AI build — from underwriting copilots like the ones in our BluCognition work to fraud-triage engines. It's opinionated, it's been audited, and it's fast enough to run in production hot paths.

The three classes of audit-trail failure

Most teams discover their audit trail is broken at the worst possible moment. The failure modes cluster into three patterns:

Incomplete capture — you logged the model's final output, but not the prompt that produced it, not the tool calls it made, and not the version of the policy that was active at inference time.
Non-deterministic replay — you have the inputs, but re-running them today produces a different output. The model version changed. The system prompt changed. The retrieved documents changed.
Untraceable lineage — you can show the AI's response, but not the human that approved it, the override that was applied, or the downstream action it triggered.

A good audit trail solves all three. A great one does so cheaply.

The shape of a defensible audit record

For every AI-mediated decision in a regulated workflow, we capture a single immutable record we call a decision envelope. It has five sections:

The Decision Envelope

Identity — request ID, customer ID, timestamp, system version
Inputs — raw inputs, normalised inputs, retrieved context, prompt template ID
Reasoning — model ID + version, system prompt hash, full message trace, tool calls, intermediate states
Output — final decision, structured rationale, confidence, policy version applied
Lineage — human reviewer ID (if any), override applied, downstream action ID

The envelope is written once, signed, and stored in a write-once data store. We use S3 with object-lock for that, but any WORM-capable store works. The signing key rotates quarterly. We keep envelopes for the longer of seven years or the customer's regulated retention window.

Make replay deterministic — or document why it isn't

The single biggest engineering investment is making decisions reproducible. Replay isn't just nice-to-have. When a regulator asks "what would the system do today with these inputs," you need to answer with confidence.

To get there, we pin everything that influences a model's behaviour:

Model version (including the provider's underlying snapshot, not just a logical alias)
System prompt — stored as content-addressable artifacts, referenced by hash
Temperature and sampling parameters
RAG retrieval — the exact document IDs and chunks returned, frozen at the moment of the call
Tool definitions — schema, version, and the responses they actually returned

If you change any of these, the envelope notes the change and the model version pin shifts. Old envelopes still replay correctly because the artifacts they reference are immutable.

For genuinely non-deterministic components — third-party APIs with rolling updates, for example — we document the boundary and store the actual response payload alongside the call. You can't replay through the API, but you can replay around it.

Structured rationale beats free-text explanation

Asking an LLM to "explain its reasoning" in prose is a trap. The explanation looks plausible, but it's frequently a post-hoc reconstruction that doesn't match what actually drove the output. Regulators and your own compliance team can't act on prose.

Instead, force the model to emit a structured rationale alongside its decision. Something like:

{
  "decision": "decline",
  "primary_factors": [
    { "factor": "debt_to_income_ratio", "value": 0.62, "threshold": 0.45, "weight": "high" },
    { "factor": "trade_line_age_months", "value": 7, "threshold": 24, "weight": "medium" }
  ],
  "policy_version": "underwriting-v2.3.1",
  "model_confidence": 0.91
}

Now your audit log isn't "the model thought this was risky" — it's "the model declined because DTI was 0.62 against a 0.45 threshold under policy v2.3.1." That's an answer a regulator can verify and a customer can dispute.

Human-in-the-loop is part of the audit trail, not a separate system

The moment a human reviewer is involved, your audit trail must capture that interaction with the same fidelity as the model's. We log:

Who saw the recommendation (name, role, timestamp)
What they saw — the exact UI state, including which fields were highlighted
What they did — approved, overrode, escalated, with reason code
What downstream system actioned the decision and when

This matters because two thirds of regulatory complaints we've seen don't actually challenge the AI — they challenge whether a human could have caught the issue. If your trail doesn't show the human had everything they needed to override, you lose that argument.

Performance: this doesn't have to be slow

A common objection: "Won't all this logging slow down inference?" In practice, no — if you do it right.

The decision envelope is constructed in-memory during the inference call, then written asynchronously to the WORM store. We use a local append-only log buffer with a few-second flush, plus a guaranteed-write path for the regulated-decision sub-system. The synchronous critical path adds well under 5ms in our production deployments.

Storage cost is also a non-issue. A typical envelope is 8–40 KB compressed. Even at hundreds of millions of decisions per year, you're talking single-digit terabytes annually — far less than the regulatory exposure of not having the trail.

What we'd tell a CTO starting from scratch

If you're standing up regulated AI today, three rules will save you a year of pain:

Design the audit trail before the model. Reverse-engineering it later means you'll always have gaps. Build the envelope first, then plug the inference into it.
Treat prompts as code. They get versioned, code-reviewed, and stored with content-addressable hashes. No editing system prompts in a Notion doc.
Pin everything pinnable. Model versions, system prompts, retrieval indexes, tool schemas. Mutability is the enemy of replay.

None of this is research-grade. It's engineering — careful, deliberate, and boring in the best way. The kind of work that means when the regulator calls, you don't lose a week piecing things together. You answer in an afternoon.

Building regulated AI?

We've shipped audit-ready agents to banks, pharma and insurance.

Book a 30-minute discovery call — we'll pressure-test your architecture, no slides.

Book a Discovery Call