I shipped an AI feature on Friday. By Monday it was a liability.

Two Fridays ago, I shipped a feature I was proud of. A small LLM-powered assistant that helped users draft security questionnaire responses based on their company’s prior answers. It was clever. It worked well in testing. Customers liked the demo.

By Monday morning, I had three problems:

The assistant had hallucinated a SOC 2 certification that the customer didn’t have — and the customer had sent that response to a prospect without checking.
A second customer’s confidential vendor assessment had leaked into another customer’s suggestions via a retrieval bug I’d missed.
The cost was 4x what I’d projected, because the assistant was being invoked on every keystroke rather than on submit.

None of these were novel failure modes. Every one of them was something I knew about in theory and ignored in practice because I was moving fast and the demo looked good.

The four guardrails

After two weeks of cleanup, apology emails, and architectural surgery, I now have four rules I won’t ship an LLM feature without:

1. Output confidence markers

Every piece of generated text gets a visual indicator of how much it should be trusted. Not a probability score — users don’t know what to do with “87% confident.” A simple traffic light: this was drawn from your own data, this was inferred, this is the model’s best guess. Let the human decide.

2. Retrieval isolation

Tenant data retrieval is not just a database query concern. It’s a prompt concern. If documents from Customer A can appear in Customer B’s context window, you have a data breach, even if the database query was correctly scoped. The retrieval layer needs its own access control, tested independently.

3. Cost circuit breakers

Set a per-user, per-hour cost ceiling on LLM API calls. When it trips, degrade gracefully — show cached results, disable the feature, surface a message. Don’t let a feedback loop between a chatty frontend and an expensive API turn your weekend into an incident.

4. A staging environment that lies to you less

My staging environment had ten test documents. Production had ten thousand. The retrieval behavior was completely different at scale — more noise, more ambiguity, more opportunities for the model to confuse two similar-looking documents. Test with production-scale data, or accept that you’re testing a different system.

The meta-lesson

The real failure wasn’t technical. It was cultural. I shipped on a Friday because I wanted the weekend to “see how it does.” That’s not a deployment strategy. That’s a hope.

LLM features are not features in the traditional sense. They’re systems that behave probabilistically, and their failure modes are not the same as the failure modes of deterministic code. They need their own release discipline — slower, more instrumented, with more human checkpoints, not fewer.

I still think the feature is a good idea. I just won’t ship it on a Friday again.