Tell me when: Building agents that can wait, monitor, and act

Modern LLM Agents can debug code, analyze spreadsheets, and book complex travel. Given those capabilities, it’s reasonable to assume that they could handle something simpler: waiting. Ask an agent to monitor your email for a colleague’s response or watch for a price drop over several days, and it will fail. Not because it can’t check email or scrape prices. It can do both. It fails because it doesn’t know when to check. Agents either give up after a few attempts or burn through their context window, checking obsessively. Neither work.
This matters because monitoring tasks are everywhere. We track emails for specific information, watch news feeds for updates, and monitor prices for sales. Automating these tasks would save hours, but current agents aren’t built for patience.
To address this, we are introducing SentinelStep (opens in new tab), a mechanism that enables agents to complete long-running monitoring tasks. The approach is simple. SentinelStep wraps the agent in a workflow with dynamic polling and careful context management. This enables the agent to monitor conditions for hours or days without getting sidetracked. We’ve implemented SentinelStep in Magentic-UI, our research prototype agentic system, to enable users to build agents for long-running tasks, whether they involve web browsing, coding, or external tools.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and Industry
Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
How it works
The core challenge is polling frequency. Poll too often, and tokens get wasted. Poll too infrequently, and the user’s notification gets delayed. SentinelStep makes an educated guess at the polling interval based on the task at hand—checking email gets different treatment than monitoring quarterly earnings—then dynamically adjusts based on observed behavior.
There’s a second challenge: context overflow. Because monitoring tasks can run for days, context overflow becomes inevitable. SentinelStep handles this by saving the agent state after the first check, then using that state for each subsequent check.
Core components
As the name suggests, SentinelStep consists of individual steps taken as part of an agent’s broader workflow. As illustrated in Figure 1, there are three main components: the actions necessary to collect information, the condition that determines when the task is complete, and the polling interval that determines timing. Once these components are identified, the system’s behavior is simple: every [polling interval] do [actions] until [condition] is satisfied.

These three components are defined and exposed in the co-planning interface of Magentic-UI. Given a user prompt, Magentic-UI proposes a complete multi-step plan, including pre-filled parameters for any monitoring steps. Users can accept the plan or adjust as needed.
Processing
Once a run starts, Magentic-UI assigns the most appropriate agent from a team of agents to perform each action. This team includes agents capable of web surfing, code execution, and calling arbitrary MCP servers.
When the workflow reaches a monitoring step, the flow is straightforward. The assigned agent collects the necessary information through the actions described in the plan. The Magentic-UI orchestrator then checks whether the condition is satisfied. If it is, the SentinelStep is complete, and the orchestrator moves to the next step. If not, the orchestrator determines the timestamp for the next check and resets the agent’s state to prevent context overflow.
Evaluation
Evaluating monitoring tasks in real-world settings is nearly impossible. Consider a simple example: monitoring the Magentic-UI repository on GitHub until it reaches 10,000 stars (a measure of how many people have bookmarked it). That event occurs only once and can’t be repeated. Most real-world monitoring tasks share this limitation, making systematic bench marking very challenging.
In response, we are developing SentinelBench, a suite of synthetic web environments for evaluating monitoring tasks. These environments make experiments repeatable. SentinelBench currently supports 28 configurable scenarios, each allowing the user to schedule exactly when a target event should occur. It includes setups like GitHub Watcher, which simulates a repository accumulating stars over time; Teams Monitor, which models incoming messages, some urgent; and Flight Monitor, which replicates evolving flight-availability dynamics.
Initial tests show clear benefits. As shown in Figure 2, success rates remain high for short tasks (30 sec and 1 min) regardless of whether SentinelStep is used. For longer tasks, SentinelStep markedly improves reliability: at 1 hour, task reliability rises from 5.6% without SentinelStep to 33.3% with it; and at 2 hours, it rises from 5.6% to 38.9%. These gains demonstrate that SentinelStep effectively addresses the challenge of maintaining performance over extended durations.

Impact and availability
SentinelStep is a first step toward practical, proactive, longer‑running agents. By embedding patience into plans, agents can responsibly monitor conditions and act when it matters—staying proactive without wasting resources. This lays the groundwork for always‑on assistants that stay efficient, respectful of limits, and aligned with user intent.
We’ve open-sourced SentinelStep as part of Magentic-UI, available on GitHub (opens in new tab) or via pip install magnetic-ui
. As with any new technique, production deployment should be preceded through testing and validation for the specific use case. For guidance on intended use, privacy considerations, and safety guidelines, see the Magentic-UI Transparency Note. (opens in new tab)
Our goal is to make it easier to implement agents that can handle long-running monitoring tasks and lay the groundwork for systems that anticipate, adapt, and evolve to meet real-world needs.
The post Tell me when: Building agents that can wait, monitor, and act appeared first on Microsoft Research.
Origianl Creator: Hussein Mozannar, Matheus Kunzler Maldaner, Maya Murad, Jingya Chen, Gagan Bansal, Rafah Hosn, Adam Fourney
Original Link: https://www.microsoft.com/en-us/research/blog/tell-me-when-building-agents-that-can-wait-monitor-and-act/
Originally Posted: Tue, 21 Oct 2025 16:00:00 +0000
What do you think?
It is nice to know your opinion. Leave a comment.