Six Pillars of Trustworthy Financial AI

Financial AI earns trust only when its reasoning is constrained, inspectable, and replayable. Outside that boundary, it isn’t really a system – it’s uncontrolled behaviour.

Simon Gregory | CTO & Co-Founder
MPhys Physics, University of Warwick

Request your copy of the Executive Summary / Full Report (PDF)

The Six Pillars: Foreword / What are the Six Pillars?
Where financial AI fails – the risks and challenges on the path to production

Pillar 1: Auditability
When you can’t see how an answer was formed, you can’t trust it

Pillar 2: Authority
When AI can’t tell who is allowed to speak, relevance replaces legitimacy

Pillar 3: Provenance
When you can’t see the lineage, the system invents it

Pillar 4: Context Integrity
When the evidential world breaks, the model hallucinates the missing structure

Pillar 5: Temporal Integrity
When time collapses, financial reasoning collapses with it

Pillar 6: Determinism
When behaviour is unstable, trust must come from the architecture, not the model

The Six Pillars: Conclusion
GenAI is a different kind of system. The only viable response is deterministic architecture.

Pillar 6: Determinism

When behaviour is unstable, trust must come from the architecture, not the model

You run the same query twice. You get different answers.

Not slightly different. Materially different. Different sources cited. Different conclusions reached. Different confidence implied. Different personality. Different preconceptions. No error message. No audit trail. Just… different.

You run it a third time. A fourth. Each time, the system sounds certain. Each time, it sounds slightly different from the last.

This isn’t a bug or a configuration issue. This is how these systems work.

It is called non-determinism. Understanding it is the architectural foundation beneath everything else in this series.

The question isn’t whether to use AI in finance, but whether you have the architecture to use it safely.

Two Worlds

Deterministic systems always produce the same output for the same input. They are stable, repeatable, testable, and auditable. Financial infrastructure depends on this property at every layer where trust and auditability are required: pricing engines, ledgers, reconciliation systems, regulatory reporting, risk models. The trust they carry comes from the fact that their behaviour does not vary.

Non-deterministic systems can produce multiple valid outputs from the same input. They are probabilistic, context-sensitive, high-dimensional, parallel, and non-linear. Their potential comes from the fact that their behaviour can vary.

LLMs are non-deterministic by design. Vector embedding models are deterministic in principle, but in practice can exhibit non-deterministic behaviour (through hardware parallelism and versioning), meaning information retrieval itself can be a source of silent variation.

These are two fundamentally different worlds. Financial AI sits directly at their boundary, where non-deterministic reasoning must feed deterministic infrastructure.

The goal is not to eliminate non-determinism from every layer, but to ensure it never operates without deterministic controls where decisions are consequential.

Why LLMs are non-deterministic

The architecture is designed for variability.

The model doesn’t retrieve a correct answer. It constructs a probable one from thousands of possibilities. Vectors encode meaning as positions in high‑dimensional space rather than as rules or lookups. The same question, surrounded by different context, activates different parts of the model. Even with fixed random seeds, micro-differences in context, timing, retrieval order, and hardware can compound into qualitatively divergent outputs.

Floating-point arithmetic in parallel execution compounds this further. Rounding errors accumulate across threads in execution-order-dependent ways that are neither observable nor reproducible.

The system explores a possibility space, rather than executing a fixed path.

The butterfly effect in financial AI

LLMs behave like chaotic systems: tiny changes produce massive differences.

Small changes in prompt phrasing can produce large shifts in tone, conclusion, or emphasis. Different retrieval results change which evidence the model reasons from. A different model version introduces silent changes in model behaviour: the same query, the same system, different outputs, with no change made by any user. A different sampling temperature changes how the model weighs competing continuations.

In practice: a risk analyst reruns a research summary on the same earnings call. The first run flags a covenant concern. The second does not. Both outputs are fluent. Neither is flagged as uncertain. The analyst doesn’t know which to trust, or whether to trust either.

This behaviour isn’t an edge-case, it’s what these systems do.

The hidden structure of failure

Non-determinism in probabilistic models doesn’t materialise as uniform randomness either. The model can appear consistent across many inputs, creating confidence that it’s behaving reliably. But its opaque failure modes (e.g., hallucination, temporal conflation, boundary violation) can activate in conditions that are unknown in advance, and are invisible in the output when they do.

Apparent consistency isn’t evidence of stability, and it will break. You can’t test your way to certainty about where the edges are, and so the only viable response is to contain the system so that when a failure mode activates, either the architecture catches it, or it’s obvious to the user.

Why non-determinism breaks financial workflows

Financial workflows are built on five assumptions that non-deterministic systems violate:

Stability – the same inputs produce the same outputs. LLMs cannot guarantee this.

Traceability – outputs can be traced to specific, verifiable sources with their full lineage intact. Generative synthesis obscures this.

Repeatability – processes can be re-run for audit, validation, or review. Non-deterministic processes cannot be replayed.

Contextual consistency – the same question, same evidence, should give the same answer. But LLMs change their answer based on what else is in the prompt.

Temporal consistency – the same question today should match the same question in three months. But model updates, retrieval inconsistency, embedding variations, and the model’s tendency to flatten time mean behaviour changes with no notification and no audit event. Finally, floating-point non-determinism in parallel execution is always present, operates below the threshold of observation, is not reproducible, and has no remediation path.

A non-deterministic output may not be the same next time, may not follow the same reasoning path, and may not be grounded in the same sources. It may change silently with model updates or with context variations in ways no one intended.

These failures compound, where the output, the reasoning, and the sources may all change. Silently.

This means raw LLM output should be considered untrusted input for any consequential downstream process.

The replayability problem

In regulated finance, the ability to reconstruct a decision is not optional. Regulators, auditors, and risk committees require it. If a firm made a consequential decision six months ago (such as a credit assessment, a risk classification, or a research-driven trade) and is asked to show its reasoning, it must be able to replay the process that produced it.

Non-deterministic systems cannot satisfy this requirement on their own by design. Each invocation starts from zero. The model that answered a query five minutes ago is not the same instance that answers it now. It has no memory of what it said, what it reasoned, or what path it took. What appears to be a continuous system is a sequence of stateless instances, each beginning fresh, each leaving no trace for the next.

The process that produced a given output doesn’t persist. The sampling state, the retrieval results, the context window, the model weights at that moment; none of these are guaranteed to be preserved or reproducible. Running the same query again doesn’t reconstruct the original reasoning. It produces a new and different output that may reach different conclusions from different sources via a different path.

There is no audit trail because there is no fixed path to audit, and no replay because the process was never deterministic in the first place.

This is not a logging problem, and capturing outputs after the fact doesn’t solve it. An output without a reproducible process is evidence of nothing. A regulator asking how a decision was reached cannot be satisfied by showing them the answer. They need to see the reasoning, the sources, the lineage, and the logic that connected them. Non-determinism makes that structurally impossible.

Human review can’t fix this. You can’t validate what you didn’t see. Even if you were there, you can’t guarantee the next run will match. The system offers no such guarantee.

This is the compliance exposure that non-determinism creates, not at the edges, but in the core of every consequential process it touches.

The deterministic trust boundary

Probabilistic black boxes fail in recognisable ways, and those ways have names.

The first five pillars of this series each describe a distinct failure mode. Separately, each is serious. Together, they describe something larger: a system under a common pressure. The sixth pillar is that pressure. It is what makes the other five necessary, and what the deterministic trust boundary exists to contain.

Auditability is the requirement that every output can be traced, verified, and replayed. It is both a direct consequence of deploying a probabilistic black box (which cannot trace its own reasoning) and the ultimate proof that the architecture is working.

Authority is the requirement that institutional legitimacy determines who shapes an answer. The system is structurally blind to this distinction, which must be enforced from outside.

Provenance is the continuity layer that preserves full lineage from source to output. Without it, nothing downstream can be trusted. It is what holds all other pillars together.

Context Integrity is the requirement that the evidential world presented to the model remains structurally intact. It can be impacted by non-deterministic elements in the retrieval. Instabilities and gaps can also be magnified by the model’s chaotic sensitivity.

Temporal Integrity is the requirement that time is treated as a first-class dimension. Outdated evidence that crosses the trust boundary doesn’t announce itself. It looks correct.

Non-determinism is the failure mode that makes replayability structurally impossible. The same query does not reproduce the same reasoning, and every output is unrepeatable by design as there is no fixed path to reconstruct.

These six failure modes are not independent, as they relate to the same underlying condition: probabilistic black box systems that explore a possibility space, rather than executing a fixed path. That is the source of their capability, and the reason their failures compound rather than sit in isolation.

The model’s chaotic sensitivity to context variation means small differences in what enters the system can produce large differences in what comes out, magnifying every failure mode it touches.

Containment is the only viable architecture

The answer to non-determinism isn’t to eliminate it, as that would eliminate the capability. The only viable response is to contain it.

Fine-tuning and prompt engineering are sometimes proposed as alternatives to containment. They address symptoms rather than the underlying condition. The model’s dispositions or context may shift, but its fundamental non-determinism is still present. The unknown failure boundaries may have been relocated, but they haven’t been removed, and will be discovered in production. Finding that a fine-tuned model or prompt performs better in testing is not evidence it will perform reliably in production.

The non-deterministic layer must be bounded by deterministic controls at every point where trust is required. That means deterministic inputs, deterministic lineage, deterministic retrieval paths, and deterministic authority signals, so that the generative step operates on a stable, validated foundation rather than an unstable one.

The LLM synthesis layer (where much of the non-determinism lives) must sit outside the trust boundary. It should receive structured, validated inputs, and its outputs should be anchored to auditable lineage. What it cannot reach, it cannot cite.

The other five pillars exist precisely because the generative layer can’t be trusted to provide them. Together they are the deterministic containment architecture that makes non-determinism safe to use.

The right architecture expands what AI can do rather than constraining it. Reasoning, synthesis, abstraction all become available when the non-deterministic layer is properly bounded.

The containment architecture is not the price you pay for using AI safely. It is what makes AI genuinely useful in finance rather than merely impressive.

Non-determinism is a property, not a feature. In finance, it must be contained by deterministic architecture.

< Previous | Pillar 5: Temporal Integrity

Next > | The Six Pillars: Conclusion

News & Insights

13th May 2026