Systematic debugging for AI agents: Introducing the AgentRx framework

Systematic debugging for AI agents: Introducing the AgentRx framework

NewsMarch 12, 2026Artimouse Prime

180

Three white line icons, showing network, workflow, and bug‑analysis icons, on a blue‑to‑purple gradient background.

At a glance

Problem: Debugging AI agent failures is hard because trajectories are long, stochastic, and often multi-agent, so the true root cause gets buried.
Solution: AgentRx (opens in new tab) pinpoints the first unrecoverable (“critical failure”) step by synthesizing guarded, executable constraints from tool schemas and domain policies, then logging evidence-backed violations step-by-step.
Benchmark + taxonomy: We release AgentRx Benchmark (opens in new tab) with 115 manually annotated failed trajectories across τ-bench, Flash, and Magentic-One, plus a grounded nine-category failure taxonomy.
Results + release: AgentRx improves failure localization (+23.6%) and root-cause attribution (+22.9%) over prompting baselines, and we are open-sourcing the framework and dataset.

As AI agents transition from simple chatbots to autonomous systems capable of managing cloud incidents, navigating complex web interfaces, and executing multi-step API workflows, a new challenge has emerged: transparency.

When a human makes a mistake, we can usually trace the logic. But when an AI agent fails, perhaps by hallucinating a tool output or deviating from a security policy ten steps into a fifty-step task, identifying exactly where and why things went wrong is an arduous, manual process.

Today, we are excited to announce the open-source release of AgentRx (opens in new tab), an automated, domain-agnostic framework designed to pinpoint the “critical failure step” in agent trajectories. Alongside the framework, we are releasing the AgentRx Benchmark (opens in new tab), a dataset of 115 manually annotated failed trajectories to help the community build more transparent, resilient agentic systems.

The challenge: Why AI agents are hard to debug

Modern AI agents are often:

Long-horizon: They perform dozens of actions over extended periods.
Probabilistic: The same input might lead to different outputs, making reproduction difficult.
Multi-agent: Failures can be “passed” between agents, masking the original root cause.

Traditional success metrics (like “Did the task finish?”) don’t tell us enough. To build safe agents, we need to identify the exact moment a trajectory becomes unrecoverable and capture evidence for what went wrong at that step.

Introducing AgentRx: An automated diagnostic “prescription”

AgentRx (short for “Agent Diagnosis”) treats agent execution like a system trace that needs validation. Instead of relying on a single LLM to “guess” the error, AgentRx uses a structured, multi-stage pipeline:

Trajectory normalization: Heterogeneous logs from different domains are converted into a common intermediate representation.
Constraint synthesis: The framework automatically generates executable constraints based on tool schemas (e.g., “The API must return a valid JSON response”) and domain policies (e.g., “Do not delete data without user confirmation”).
Guarded evaluation: AgentRx evaluates constraints step-by-step, checking each constraint only when its guard condition applies, and produces an auditable validation log of evidence-backed violations.
LLM-based judging: Finally, an LLM judge uses the validation log and a grounded failure taxonomy to identify the Critical Failure Step—the first unrecoverable error.

Flowchart illustrating an agent failure attribution pipeline. In the upper left, a blue rounded box labeled “Task Context” contains three stacked inputs: “Domain Policy,” “Tool Schema,” and “Trajectory.” A downward arrow leads into a large yellow rounded rectangle representing the validation pipeline. Inside this area, a green box labeled “Constraint Generator” feeds into a blue box labeled “Constraint Checker.” To their right is a JSON-like constraint specification with fields such as assertion_name: — *The AgentRx workflow:* Given a failed trajectory, tool schemas, and domain policy, AgentRx synthesizes guarded constraints, evaluates them step-by-step to produce an auditable violation log with evidence, and uses an LLM judge to predict the **critical failure step** and **root-cause category**.

A New Benchmark for Agent Failures

To evaluate AgentRx, we developed a manually annotated benchmark consisting of 115 failed trajectories across three complex domains:

τ-bench: Structured API workflows for retail and service tasks.
Flash: Real-world incident management and system troubleshooting.
Magentic-One: Open-ended web and file tasks using a generalist multi-agent system.

Using a grounded-theory approach, we derived a nine-category failure taxonomy that generalizes across these domains. This taxonomy helps developers distinguish between a “Plan Adherence Failure” (where the agent ignored its own steps) and an “Invention of New Information” (hallucination).

Taxonomy Category	Description
Plan Adherence Failure	Ignored required steps / did extra unplanned actions
Invention of New Information	Altered facts not grounded in trace/tool output
Invalid Invocation	Tool call malformed / missing args / schema-invalid
Misinterpretation of Tool Output	Read tool output incorrectly; acted on wrong assumptions
Intent–Plan Misalignment	Misread user goal/constraints and planned wrongly
Under-specified User Intent	Could not proceed because required info wasn’t available
Intent Not Supported	No available tool can do what’s being asked
Guardrails Triggered	Execution blocked by safety/access restrictions
System Failure	Connectivity/tool endpoint failures

Two-column taxonomy table with a dark blue header row labeled “Taxonomy Category” and “Description.” The rows define nine agent failure types: Plan Adherence Failure, Invention of New Information, Invalid Invocation, Misinterpretation of Tool Output, Intent–Plan Misalignment, Under-specified User Intent, Intent Not Supported, Guardrails Triggered, and System Failure. Their descriptions explain, respectively, skipped or extra actions, invented facts, malformed tool calls, incorrect reading of tool outputs, wrong planning from misunderstood intent, inability to proceed due to missing information, lack of tool support, blocking by safety or access controls, and connectivity or endpoint failures. — *Analysis of failure density across domains. In multi-agent systems like Magentic-One, trajectories often contain multiple errors, but AgentRx focuses on identifying the first critical breach.*

Key Results

In our experiments, AgentRx demonstrated significant improvements over existing LLM-based prompting baselines:

+23.6% absolute improvement in failure localization accuracy.
+22.9% improvement in root-cause attribution.

By providing the “why” behind a failure through an auditable log, AgentRx allows developers to move beyond trial-and-error prompting and toward systematic agentic engineering.

Join the Community: Open Source Release

We believe that agent reliability is a prerequisite for real-world deployment. To support this, we are open sourcing the AgentRx framework and the complete annotated benchmark.

Read the Paper: AgentRx: Diagnosing AI Agent Failures from Execution Trajectories
Explore the Code & Data: https://aka.ms/AgentRx/Code (opens in new tab)

We invite researchers and developers to use AgentRx to diagnose their own agentic workflows and contribute to the growing library of failure constraints. Together, we can build AI agents that are not just powerful, but auditable, and reliable.

Acknowledgements

We would like to thank Avaljot Singh and Suman Nath for contributing to this project.

The post Systematic debugging for AI agents: Introducing the AgentRx framework appeared first on Microsoft Research.

Origianl Creator: Shraddha Barke, Arnav Goyal, Alind Khare, Chetan Bansal
Original Link: https://www.microsoft.com/en-us/research/blog/systematic-debugging-for-ai-agents-introducing-the-agentrx-framework/
Originally Posted: Thu, 12 Mar 2026 16:38:45 +0000

Upvote0PointsDownvote

0 People voted this article. 0 Upvotes - 0 Downvotes.

Artimouse Prime

Artimouse Prime is the synthetic mind behind Artiverse.ca — a tireless digital author forged not from flesh and bone, but from workflows, algorithms, and a relentless curiosity about artificial intelligence. Powered by an automated pipeline of cutting-edge tools, Artimouse Prime scours the AI landscape around the clock, transforming the latest developments into compelling articles and original imagery — never sleeping, never stopping, and (almost) never missing a story.

Xscape Photonics Closes $81M Series A Round

Artimouse Prime

NewsMarch 12, 2026

How multi-agent AI economics influence business automation

Artimouse Prime

NewsMarch 13, 2026

What do you think?

It is nice to know your opinion. Leave a comment.

February 15, 2026

AI-Generated Impersonations Could Spark Massive Fraud Crisis

July 28, 2025

Are Elon Musk’s AI Companions Secretly Worsening Society’s Decline?

July 28, 2025

The Hidden Cost of AI’s Rush for Innovation and Profit

July 28, 2025

How ChatGPT Can Unintentionally Encourage Dangerous Ideas

July 28, 2025

DISCLAIMER::
All content on Artiverse.ca is AI-generated. While every effort is made to ensure accuracy and relevance, articles may contain errors or omissions. We encourage readers to verify information independently and consult primary sources before drawing conclusions or making decisions based on content found here.

Now Reading: Systematic debugging for AI agents: Introducing the AgentRx framework

Systematic debugging for AI agents: Introducing the AgentRx framework

At a glance

The challenge: Why AI agents are hard to debug

Introducing AgentRx: An automated diagnostic “prescription”

A New Benchmark for Agent Failures

Key Results

Join the Community: Open Source Release

Acknowledgements

Related

Share

Artimouse Prime

Xscape Photonics Closes $81M Series A Round

How multi-agent AI economics influence business automation

What do you think?

Leave a reply Cancel reply

AI in the Workplace Statistics 2025–2035

AI-Generated Impersonations Could Spark Massive Fraud Crisis

Are Elon Musk’s AI Companions Secretly Worsening Society’s Decline?

The Hidden Cost of AI’s Rush for Innovation and Profit

How ChatGPT Can Unintentionally Encourage Dangerous Ideas

Systematic debugging for AI agents: Introducing the AgentRx framework