All Insights
Applied AIFinTech·12 min read·May 2026

Designing audit trails for AI agents in banking

When the regulator asks why your agent denied a $250,000 loan application, "the model said so" is not an answer. This is the architecture we use to make every decision reconstructible — without slowing down inference.

OS
OlloSoft Engineering
Published May 12, 2026

Three years ago, a Tier-1 bank we worked with received an explanation request from their national regulator. A small-business owner had been denied credit and filed a complaint. The bank's underwriting flow had recently been augmented with an LLM-based reasoning layer. The compliance officer asked us a deceptively simple question: "Can you tell me exactly why this application was rejected — every input, every rule, every model output, in the order it happened?"

It took the team a week to piece together. They got the answer, just barely. The regulator accepted it. But the lesson stuck: in regulated finance, your AI is only as good as the audit trail that proves how it decided.

This post lays out the architecture we now use on every banking AI build — from underwriting copilots like the ones in our BluCognition work to fraud-triage engines. It's opinionated, it's been audited, and it's fast enough to run in production hot paths.

The three classes of audit-trail failure

Most teams discover their audit trail is broken at the worst possible moment. The failure modes cluster into three patterns:

  1. Incomplete capture — you logged the model's final output, but not the prompt that produced it, not the tool calls it made, and not the version of the policy that was active at inference time.
  2. Non-deterministic replay — you have the inputs, but re-running them today produces a different output. The model version changed. The system prompt changed. The retrieved documents changed.
  3. Untraceable lineage — you can show the AI's response, but not the human that approved it, the override that was applied, or the downstream action it triggered.

A good audit trail solves all three. A great one does so cheaply.

The shape of a defensible audit record

For every AI-mediated decision in a regulated workflow, we capture a single immutable record we call a decision envelope. It has five sections:

The Decision Envelope
  • Identity — request ID, customer ID, timestamp, system version
  • Inputs — raw inputs, normalised inputs, retrieved context, prompt template ID
  • Reasoning — model ID + version, system prompt hash, full message trace, tool calls, intermediate states
  • Output — final decision, structured rationale, confidence, policy version applied
  • Lineage — human reviewer ID (if any), override applied, downstream action ID

The envelope is written once, signed, and stored in a write-once data store. We use S3 with object-lock for that, but any WORM-capable store works. The signing key rotates quarterly. We keep envelopes for the longer of seven years or the customer's regulated retention window.

Make replay deterministic — or document why it isn't

The single biggest engineering investment is making decisions reproducible. Replay isn't just nice-to-have. When a regulator asks "what would the system do today with these inputs," you need to answer with confidence.

To get there, we pin everything that influences a model's behaviour:

If you change any of these, the envelope notes the change and the model version pin shifts. Old envelopes still replay correctly because the artifacts they reference are immutable.

For genuinely non-deterministic components — third-party APIs with rolling updates, for example — we document the boundary and store the actual response payload alongside the call. You can't replay through the API, but you can replay around it.

Structured rationale beats free-text explanation

Asking an LLM to "explain its reasoning" in prose is a trap. The explanation looks plausible, but it's frequently a post-hoc reconstruction that doesn't match what actually drove the output. Regulators and your own compliance team can't act on prose.

Instead, force the model to emit a structured rationale alongside its decision. Something like:

{
  "decision": "decline",
  "primary_factors": [
    { "factor": "debt_to_income_ratio", "value": 0.62, "threshold": 0.45, "weight": "high" },
    { "factor": "trade_line_age_months", "value": 7, "threshold": 24, "weight": "medium" }
  ],
  "policy_version": "underwriting-v2.3.1",
  "model_confidence": 0.91
}

Now your audit log isn't "the model thought this was risky" — it's "the model declined because DTI was 0.62 against a 0.45 threshold under policy v2.3.1." That's an answer a regulator can verify and a customer can dispute.

Human-in-the-loop is part of the audit trail, not a separate system

The moment a human reviewer is involved, your audit trail must capture that interaction with the same fidelity as the model's. We log:

This matters because two thirds of regulatory complaints we've seen don't actually challenge the AI — they challenge whether a human could have caught the issue. If your trail doesn't show the human had everything they needed to override, you lose that argument.

Performance: this doesn't have to be slow

A common objection: "Won't all this logging slow down inference?" In practice, no — if you do it right.

The decision envelope is constructed in-memory during the inference call, then written asynchronously to the WORM store. We use a local append-only log buffer with a few-second flush, plus a guaranteed-write path for the regulated-decision sub-system. The synchronous critical path adds well under 5ms in our production deployments.

Storage cost is also a non-issue. A typical envelope is 8–40 KB compressed. Even at hundreds of millions of decisions per year, you're talking single-digit terabytes annually — far less than the regulatory exposure of not having the trail.

What we'd tell a CTO starting from scratch

If you're standing up regulated AI today, three rules will save you a year of pain:

  1. Design the audit trail before the model. Reverse-engineering it later means you'll always have gaps. Build the envelope first, then plug the inference into it.
  2. Treat prompts as code. They get versioned, code-reviewed, and stored with content-addressable hashes. No editing system prompts in a Notion doc.
  3. Pin everything pinnable. Model versions, system prompts, retrieval indexes, tool schemas. Mutability is the enemy of replay.

None of this is research-grade. It's engineering — careful, deliberate, and boring in the best way. The kind of work that means when the regulator calls, you don't lose a week piecing things together. You answer in an afternoon.

Building regulated AI?

We've shipped audit-ready agents to banks, pharma and insurance.

Book a 30-minute discovery call — we'll pressure-test your architecture, no slides.

Book a Discovery Call

Continue reading