How to build an AI agent that actually works
Everyone is building “agents,” but there is a large disagreement on what that means. David Loker, VP of AI at CodeRabbit, which runs one of the most widely deployed agentic code review systems in production, has a practical definition that cuts through the hype: The company’s code review “happens in a workflow” with “two agentic loops” embedded at specific points where reasoning is actually needed. Not an autonomous AI roaming free. For CodeRabbit, their agent is a workflow with intelligence inserted where it counts.
That distinction—agents embedded in workflows, not agents as autonomous beings—turns out to be the difference between a demo and a production system. Here’s how to build the production version, grounded in CodeRabbit’s experience and backed by peer-reviewed research.
Although my central example here is a code review agent, the same eight basic principles discussed (and the 10-point checklist) apply to building any kind of agent.
Start with the workflow, not the model
Loker describes CodeRabbit’s architecture as “a workflow with models chosen at various stages… with agentic loops using other model choices.” The system doesn’t start with a large language model (LLM) and hope. It runs a deterministic pipeline that fetches the diff, builds the code graph, runs static analysis, identifies changed files, determines review scope, and then inserts agentic steps where judgment is actually needed.
“There are some things that we know are very important so we run them anyway,” Loker says. “The code graph analysis, import graph analysis, having this static analysis tool information there, the diff, and some of the file-level information.” This base context gets assembled deterministically before any reasoning model is invoked.
Research confirms this hybrid approach. The Agentic Design Patterns framework identifies five subsystems every agent needs: Perception and Grounding, Reasoning and World Model, Action Execution, Learning and Adaptation, and Inter-Agent Communication. ReAct (Reason + Act), the popular pattern where an LLM interleaves chain-of-thought reasoning with tool calls in a single loop, skips most of these subsystems, which is why it’s fragile. Separately, hybrid architectures that combine structured workflow with embedded agentic loops achieve an 88.8% average Goal Completion Rate across five domains, outperforming pure ReAct, chain-of-thought, and tool-only agents on most metrics, including ROI (see Towards Outcome-Oriented, Task-Agnostic Evaluation of AI Agents).
The takeaway is to map your domain process first. Identify which steps require judgment (agentic) and which are mechanical (deterministic). Build the workflow skeleton, then embed agents where they add value.
Context engineering is the whole game
“Context engineering is the bread and butter” of what CodeRabbit does, Loker says. Not prompt engineering, but context engineering. The difference: Whereas prompt engineering is the process of crafting clever instructions, context engineering is the process of assembling the right information from the right sources, in the right structure, at the right time, for each step of the workflow.
CodeRabbit assembles context from the diff itself, full files, related files discovered via import graph, the code graph built from abstract syntax tree analysis, static analysis results, user-configured review instructions, learned patterns from past feedback, MCP (Model Context Protocol)-connected documentation, and web-fetched library docs. “There’s a massive exploration about how does this PR [pull request] connect up with all of the other aspects of the code,” Loker explains. “Which places could possibly have been impacted by your change and which parts of the codebase impact you.”
The level of detail is deliberately chosen per step. “The LLM is looking for something, it’s looking for a specific piece, and you can give that level of detail,” Loker explains. “Is it looking for just high-level summaries? Is it looking for snippets of code? Is it looking for actual line number code, detailed information? Do I need the whole function or do I just need the function signature? Sometimes, it might only be a function signature and maybe what the function is trying to accomplish that is enough information for us to understand whether or not you’re using it correctly.”
In a recent academic survery of context engineering for large language models, which considered retrieval and generation, processing, management, and system implementations including RAG (retrieval-augmented generation), memory, tools, and multi-agent coordination, the key finding was that LLMs with advanced context engineering are remarkably good at understanding complex contexts but limited at generating equally complex outputs. However, the inverse is also true: Models capable of generating complex outputs are not necessarily good at understanding complex contexts.
The Agentic Context Engineering (ACE) study proves that context should be treated as an evolving playbook, not a static prompt. The ACE system uses incremental delta updates organized as structured “bullets” with metadata that grow and refine over time, rather than monolithic prompt rewrites. Monolithic rewriting caused context to collapse from 18,282 tokens to 122, dropping accuracy from 66.7% to 57.1%. The system treating context as a living, structured document achieved +10.6% on agent benchmarks.
But more context can make your agent worse
Here’s the trap Loker himself flags: “Context packing to that degree will end up in the situation where you’ll just forget. And it’ll only pay attention to some of it. Ultimately, even if it’s factually correct information, the LLM’s performance will degrade as you increase the context size.”
That’s not just an engineering intuition. Researchers at TII and Sapienza formalized this observation as The Distracting Effect, and the numbers are sobering. Not all irrelevant content is equally dangerous. They identified four types of distractors, from weakest to strongest: Related Topic (discusses something nearby but doesn’t contain the answer), Hypothetical (“In ancient Roman times…”), Negation (“It is a common misconception that…”), and Modal Statement (“The Pyramids may have been built via…”). That last category, hedged wrong answers, is the most dangerous because it mimics the style of authoritative text.
The counterintuitive finding is that better retrievers produce more dangerous distractors. The irrelevant results surfaced by stronger retrieval pipelines are more misleading than those from weaker ones. This makes RAG especially dangerous, because pulling semantically related but irrelevant information distracts the model more than nonsense. Adding a reranker makes it worse because related but irrelevant passages that survive reranking are the ones most likely to fool the LLM. Hard distracting passages reduce accuracy by as much as six to 11 points, even when the correct passage is also in the prompt.
This is why Loker emphasizes the selection step: “How do I then choose the information that’s appropriate? And that’s the part that’s like the actual context engineering because you can grab everything, but then you run out of space.”
And more skills can make it worse, too
The same principle applies to procedural knowledge. SkillsBench, a large-scale benchmark for testing “Agent Skills” (structured playbooks injected at inference time) found that human-curated, focused skills raise the pass rate by +16.2 percentage points (pp) on average. But there are traps.
Self-generated skills, where the model creates procedural knowledge before solving the task, provide no benefit on average (-1.3pp). GPT-5.2 actually degraded by -5.6pp. Today’s models cannot reliably author the procedural knowledge they benefit from consuming. This means that auto-generated playbooks need human curation, not just agent self-reflection.
Two or three focused skills are optimal (+18.6pp). Once you hit four or more, gains collapse to +5.9pp. Comprehensive documentation actually hurts performance by -2.9pp. And 16 of 84 tasks showed negative deltas, meaning the skills introduced conflicting guidance or unnecessary complexity for tasks the model already handled well.
One bright spot: A smaller model plus good skills can match a larger model without skills. Haiku 4.5 with skills (27.7%) outperformed Opus 4.5 without skills (22.0%). Investing in curated procedural knowledge is often a better use of budget than upgrading to a bigger model.
Use the right model for each job
“We’re using a combination of more than 10 different variants, depending on the area of the workflow that you’re in,” Loker says. Not because one model is bad, but because different steps need different capabilities.
“At every part of the workflow, depending on the level of requirement, depending on where that workflow sits in terms of difficulty, a model choice will be made,” Loker explains. “The details of how you call that model, the parameters, whether or not they’re using a lot of reasoning tokens or fewer reasoning tokens, the verbosity level… do I need to worry about things like latency? In which case I need to use some of these other models, especially for looping. For example, if I have a large latency and I’m looping, that sort of blows up.”
The cost dimension matters too. “We don’t pass on token costs to customers,” Loker says. “So ultimately we’re incentivized to find the models that are necessary, but ideally this [the chosen model] is necessary and sufficient for solving the problem. So we obviously bias towards quality, but we’re also testing out all the time what is the lowest tier that we can do and maintain that quality bar.”
Build your tool pipeline deliberately
CodeRabbit doesn’t just “give the agent tools.” It runs a deliberate pipeline where each tool has a specific role. The base layer assembles the context (the diff, static analysis, and file-level information), and the agentic layer sits on top, where “it’s going to be able to read files and search for things and look at, for example, abstract syntax tree information to try and figure out where it’s connected,” Loker says.
Static analysis tools are used not to surface results directly, because those have a high rate of false positives, but to help the LLM understand where there might be issues. “So it can then reason about whether or not that’s actually an issue, given all of the other information that it has,” Loker says. The LLM becomes a reasoning layer on top of deterministic analysis, not a replacement for it.
Web queries fill knowledge gaps at runtime. “You might have a library that you use and the cutoff date of the LLM predates that library, or it predates the version of the library that you’re using,” Loker explains. “And so we might need to pull in documentation around what this function is, because we’ll otherwise come up with an error.”
Research formalizes tool use as four distinct stages, each requiring explicit engineering:
- Tool discovery: How does the agent know what tools exist?
- Tool selection: Given the task, which tool?
- Tool invocation: Calling the tool correctly with proper parameters and error handling.
- Result integration: Parsing output and injecting it back into reasoning.
Tool selection is the critical bottleneck because most agent failures occur there rather than during invocation.
Memory needs active curation
“Developers chat in the PR back to CodeRabbit,” Loker explains. “And so CodeRabbit will look at that information and be like, oh, they didn’t like this comment. Or they’ll give us some information. Like in our organization, we don’t do things that way, we do it this way. And so we’ll take that information and over time we’re able to adjust.”
The storage is structured and retrieved by context, not appended to a log. Loker gives a concrete example: “You should really be using getters and setters… and you’re like, ‘We don’t care about that here.’ We’ll store that information and then later on, if we’re going to bring up a comment like that, that’s retrieved using RAG. So looking at the context of a future review… we’ll put that into the context window to say, do not rig comments related to getters and setters.”
This is per-organization customization at scale. “All these little things enrich the context, which allows us to do a PR in a very nuanced way and change it across organizations without having to build a new model, which obviously is not scalable for every single organization.”
The MemInsight paper demonstrates that autonomous memory augmentation—enriching stored interactions with semantic metadata, relationships, and context—yields +34% recall improvement over naive RAG baselines. Memory needs active curation, not just storage. MAIN-RAG shows that filtering retrieved context with multiple agents before passing it to the generator is as important as the retrieval itself. Don’t feed everything you retrieve to the LLM; use multi-agent consensus to decide what’s relevant.
Verify your own output
“We deal with this through our post review verification system,” Loker says. “Ultimately, it’s not even necessarily going to be the same model. Like if Sonnet 4.5 does the review, it doesn’t mean that it’s going to be doing the verification. So if it lies, someone’s going to catch it, basically.”
This cross-model verification is deliberate. There’s a benefit because the training of the different models is different. “They’re what they care about, what they focus on; their distributions are different,” he says. “And so you’re going to get a blended experience, which, typically speaking, works out a little bit better.”
The verification goes beyond hallucination detection to checking whether claims are grounded: “This file… say there’s an error, the file doesn’t even exist. So let’s just throw this thing out.” There’s also a false-positive check and an agentic loop that can go back and double check its work by re-examining source files referenced in review comments. Loker calls this “de-noising.”
Cross-model verification, using a different model to check the output of the first, is a specific form of what the research literature calls model feedback. Model feedback is one of four feedback mechanisms alongside human feedback, environmental feedback (execution signals), and tool feedback (static analysis, tests). Environmental feedback is the cheapest and often most reliable. Human feedback is the highest quality but doesn’t scale.
The ACE framework formalizes this as the Reflector role, a component whose entire job is critiquing the Generator’s output. The ACE researchers’ ablation study shows that removing the Reflector significantly degrades performance. Critically, the Reflector must be separate from the Generator; self-reflection has blind spots that cross-component verification catches. The Agentic Design Patterns framework describes cross-component verification as two patterns working together: the Reflector (analyze outcomes for causality) and the Integrator (validate all information before it reaches the reasoning core). If your agent hallucinates, you’re missing an Integrator.
Evaluation is a never-ending investment
Evals are increasingly being seen as the starting point of developing agents. Andrej Karapathy said that Software 1.0 (handwritten programs) easily automates what you can specify, and Software 2.0 (AI-written programs) easily automates what you can verify. And both Greg Brockman and Mike Krieger have agreed that “evals are surprisingly often all you need” and that it is a core skill.
“It’s not something I can be passive about,” Loker says of model evaluation. “And ultimately your customers also are going to expect that you’re using the latest models, and to some degree you have to be willing to provably, at least to some degree, explain to them why this model might not be the model that you’re going to use.”
The evaluation problem is compounded by the fact that models keep changing underneath you. Loker compares it to forced library upgrades: “It’s almost like your library is being forcibly upgraded continuously, and the maintenance of code that goes along with that… Most software engineers would be like, no, don’t do that to me.” And it happens “every few months they’re being automatically, forcefully updated.”
CodeRabbit’s evaluation framework is multi-layered. First, offline metrics: “Is this model as good or directionally better than an existing model at finding issues from a recall precision perspective. Looking at the number of comments [the model] posted, how many were required before it found the same number of bugs?” Signal-to-noise ratio matters: “If it posted fewer comments but found the same number of issues, then we know that the signal to noise ratio has been improved.”
Next, qualitative review: “What do these comments look like, what’s the tone of them, how many of them are patches?” CodeRabbit even checks for hedging language. Loker notes that Sonnet 4.5, for example, will hedge and say something “might be an issue,” in which case his team will consider fixing it.
Then staged rollout: “We’ll branch out again and start rolling it more slowly. How are people perceiving it? And we’ll be watching. We’re watermarking to understand: Does this model achieve higher acceptance rates? Are people essentially abandoning the whole system as a result of this model?”
The GPT-5 launch was a case study in why this matters. “The expectations were pretty high because the recall rates were really good, and the various other metrics were really good. But ultimately, the latency was like, you know, that particular metric was kind of crazy.” CodeRabbit also found that even though GPT-5 pricing was cheaper than Sonnet 4.5, when it comes to its actual million token costs, it uses a lot more thinking tokens. “So the cost-benefit doesn’t really come into play,” Loker says.
The Outcome-Oriented Evaluation of AI Agents framework proposes measuring agents on 11 dimensions, including Goal Completion Rate, Autonomy Index, Multi-Step Task Resilience, and ROI, not just latency and throughput. The researchers’ finding: No single architecture dominates all dimensions. You must profile your use case and measure what matters for your domain.
If you go multi-agent, topology matters
CodeRabbit’s architecture is essentially a coordinated multi-agent system, with different models handling different review stages and a workflow orchestrating their interaction. “There’s right now two agentic loops,” Loker says. “One is before the big review with the heavier reasoning model, and then another one comes out afterward.”
When building multi-agent systems, the coordination topology measurably affects performance. Graph topology (agents communicate freely) outperforms tree, chain, and star (central coordinator) topologies for complex reasoning tasks. Adding an explicit “Plan how to collaborate” step before agents start working improves milestone achievement by +3% (MultiAgentBench). Default to graph for complex tasks. Star is simpler but weaker.
A checklist for agent builders
Building an agent? Here’s the order of operations, grounded in production experience and peer-reviewed research.
- Figure out what you can evaluate. This is, at a high level, the business value assessment, and at a lower level, how fast or how often a workflow solves the problem. It will never be 100%. Invest in a workflow forever with continuous rollouts.
- Map the workflow. Identify deterministic steps vs. steps that need judgment. Don’t make the whole thing agentic.
- Engineer your context. Assemble the right information for each step, not everything but not too little. Use structured, itemized context, not narrative blobs. Filter aggressively: Irrelevant context actively degrades performance, and better retrievers surface more dangerous distractors.
- Curate procedural knowledge carefully. Human-written, focused skills help enormously. But don’t let agents write their own playbooks. Keep them to two or three focused modules, and remember that comprehensive documentation hurts more than it helps.
- Choose models per step. Different steps, different models. Smaller/faster where you can, heavier where you must.
- Build tools deliberately. Discovery, selection, invocation, integration. Each stage needs its own error handling.
- Build memory with curation. Don’t just log but augment, structure, and filter what gets stored and retrieved.
- Verify your own output. Separate generation from verification. Use a different model or approach to check the output.
- Design feedback loops. Environmental signals, user feedback, cross-model critique. Design them in from day one.
- If multi-agent, think topology. Graph beats tree beats chain beats star. Plan collaboration explicitly.
The bottom line
The bottom line might be that we shouldn’t all be modeling Claude Code or OpenClaw and making a linear agent that manages its own context and does “whatever.” Instead we should be developing curated workflows with very specific tools. We should make sure the evaluations are well thought out for the overall workflow and are handled up front. In a month or three, the model and everything else will change, so evaluations are eternally useful.
Original Link:https://www.infoworld.com/article/4141524/how-to-build-an-ai-agent-that-actually-works.html
Originally Posted: Mon, 16 Mar 2026 09:00:00 +0000












What do you think?
It is nice to know your opinion. Leave a comment.