The audit log is the most underrated AI safety feature
If you can't replay what your agent did, you don't have a product — you have a wager.
Every team I’ve audited in the last year that’s running an LLM agent in production has some form of logging. Most of them log the model’s output. Some log the user’s input. Almost none of them log the full context window — the system prompt, the conversation history, the retrieved documents, the tool call chain, and the model parameters.
This means that when something goes wrong — and it will — they can tell me what the model said, but not why it said it. They have the verdict but not the trial transcript.
Why context logging matters
The model’s output is a function of its input. When a model makes a bad decision — hallucinating a fact, following a prompt injection, using a tool it shouldn’t — the diagnosis is always in the input, not the output. You need to be able to reconstruct, exactly, what the model saw at the moment it made the decision.
This means logging:
- The complete system prompt, including any version identifier
- The full conversation history for the session
- Every document retrieved by the RAG layer, including relevance scores
- Every tool call the model made, and the response it received
- The model identifier, temperature, and any other generation parameters
- Timestamps precise enough to reconstruct the order of operations
The cost objection
“That’s a lot of data.” Yes, it is. But you’re already paying for the tokens. Storing them is cheap by comparison, and the alternative — not knowing why your agent did what it did — is far more expensive when a customer, regulator, or journalist asks.
Compress it. Retain it for 90 days. Put it behind access controls. But log it. The first time you have an incident and can replay the full context, you’ll wonder how you ever operated without it.
The privacy objection
“We can’t log user conversations.” Maybe. It depends on your jurisdiction and your terms of service. But the answer is rarely “log nothing.” It’s usually “log what you need, redact what you must, and retain for the minimum period that makes forensic investigation possible.”
The audit log isn’t about surveillance. It’s about accountability. If your agent can make decisions that affect people’s money, data, or access, you need a record of how those decisions were made. This is table stakes in traditional software. It should be table stakes for AI.