How to make AI agents reliable
It’s time to wake up from the fever dream of autonomous AI. For the past year, the enterprise software narrative has been dominated by a singular, intoxicating promise: agents. We were told that we were on the verge of deploying digital employees that could plan, reason, and execute complex workflows while we watched from the sidelines, which Lena Hall, Akamai senior director of developer relations, rightly pillories.
But if you look at what is actually shipping in the enterprise, the reality is starkly different.
Drew Breunig recently published a sober analysis that acts as a necessary corrective to the hype. Breunig synthesizes data from several reports, including a study, “Measuring Agents in Production” and reveals a simple, inconvenient truth: The biggest obstacle to an agent-driven future is agentic unreliability.
In other words, most enterprise agents aren’t failing because the models aren’t smart enough; they’re failing because they aren’t boring enough.
The reliability gap
Breunig’s core finding is that successful production agents are “simple and short.” They don’t autonomously navigate the internet to solve open-ended problems. Instead, 68% of them execute fewer than 10 steps before handing control back to a human or concluding the task.
This aligns perfectly with what I’ve been calling AI’s trust tax. In our rush to adopt generative AI, we forgot that while intelligence is becoming cheap, trust remains expensive. A developer might be impressed that an agent can solve a complex coding problem 80% of the time. But a CIO looks at that same agent and sees a system that introduces a 20% risk of hallucination, data leakage, or security vulnerability into a production environment.
That 20% gap is the reliability gap. As Breunig notes, rational employees don’t adopt unreliable tools. They route around them.
Easier said than done. After all, the way genAI works, we’re trying to build deterministic software on top of probabilistic models. Large language models (LLMs), cool though they may be, are non-deterministic by nature. Chaining them together into autonomous loops amplifies that randomness. If you have a model that is 90% accurate, and you ask it to perform a five-step chain of reasoning, your total system accuracy drops to roughly 59%.
That isn’t an enterprise application; it’s a coin toss—and that coin toss can cost you. Whereas a coding assistant can suggest a bad function, an agent can actually take a bad action.
The solution, counterintuitively, is to be less ambitious.
Lower ambitions mean greater success
Breunig argues that the path forward is to “deliberately constrain agent autonomy.” This is exactly right. In other words, we need to stop trying to build “God-tier” agents that can do everything and start building “intern-tier” agents that do one thing perfectly.
This brings us back to the concept of the “golden path,” something I’ve been writing about repeatedly.
We don’t want platform engineering teams to become insurmountable obstacles that only know the word “no” (that’s what the legal team is for—kidding!). Platform teams should build paved roads (golden paths) that make the right way to build software also the easiest way. For agents, this means creating standardized, governed frameworks where the blast radius is contained by design. A golden path for an agent might look like this:
- Narrow scope: The agent is authorized to perform exactly one function (e.g., “reset password” or “summarize JIRA ticket”), not “manage IT support.”
- Read-only by default: The agent can read data to answer questions but requires explicit human approval to write to a database or call an external API. This is key to building AI agents the safe way.
- Structured output: We stop relying on vibes and start enforcing schemas. The agent shouldn’t just chat; it should return structured JSON that can be validated by code before it triggers any action.
All good, but more is needed. I’m talking about how we handle agent memory.
Memory is a database problem
Breunig highlights “context poisoning” as a major reliability killer, where an agent gets confused by its own history or irrelevant data. We tend to treat the context window like a magical, infinite scratchpad. It isn’t. It is a database of the agent’s current state. If you fill that database with garbage (unstructured logs, hallucinated prior turns, or unauthorized data), you get garbage out.
If you want reliable agents, you need to apply the same rigor to their memory that you apply to your transaction logs:
- Sanitization: Don’t just append every user interaction to the history. Clean it.
- Access control: Ensure the agent’s “memory” respects the same row-level security (RLS) policies as your application database. An agent shouldn’t “know” about Q4 financial projections just because it ingested a PDF that the user isn’t allowed to see.
- Ephemeral state: Don’t let agents remember forever. Long contexts increase the surface area for hallucinations. Wipe the slate clean often.
My Oracle colleague Richmond Alake calls this emerging discipline “memory engineering” and, as I’ve covered before, frames it as the successor to prompt or context engineering. You can’t just add more tokens to a context window to improve a prompt. Instead, you must create a “data-to-memory pipeline that intentionally transforms raw data into structured, durable memories: short term, long term, shared, and so on.”
The rebellion against robot drivel
Finally, we need to talk about the user. One reason Breunig cites for the failure of internal agent pilots is that employees simply don’t like using them. A big part of this is what I call the rebellion against robot drivel. When we try to replace human workflows with fully autonomous agents, we often end up with verbose, hedging, soulless text, and it’s increasingly obvious to the recipient that AI wrote it, not you. And if you can’t be bothered to write it, why should they bother to read it?
This is why keeping a human in the loop isn’t just a safety feature; it’s a quality feature. It’s also how you start to bootstrap trust. You start with suggestion mode, then graduate to partial automation only where you have measured reliability. Unsurprisingly, then, the most successful agents described in the reports Breunig cites are those that augment human work rather than replace it. They act as a copilot (to borrow the Microsoft nomenclature) that drafts the email, writes the SQL query, or summarizes the report, but then pauses and asks a human: “Does this look right?”
The reliability is high because the human is the final filter. The trust tax is low because the human remains accountable.
The boring revolution
We are leaving the phase of AI magical thinking and entering the phase of AI industrialization. The headlines about artificial general intelligence and superintelligence are fun, but they are distractions for the enterprise developer. AI is all about inference now, or the application of models to specific, governed data.
As Breunig’s analysis confirms, the agents that will actually survive in the enterprise aren’t the ones that promise to do everything. They do a few things reliably, securely, and boringly. The cure, in short, is not to wait for GPT-6 as some AI panacea. The cure is boring engineering that constrains blast radius, governs state, measures reality, and earns trust, one small workflow at a time.
In the enterprise, “boring” is what scales.
Original Link:https://www.infoworld.com/article/4112542/how-to-make-ai-agents-reliable.html
Originally Posted: Mon, 05 Jan 2026 09:00:00 +0000












What do you think?
It is nice to know your opinion. Leave a comment.