GroundedPlanBench Enhances Robot Planning with Spatial Awareness
Robot planners that rely on vision-language models often struggle with complex, long tasks. The main challenge is that natural-language instructions can be ambiguous, especially when specifying actions and locations. To tackle this, researchers developed GroundedPlanBench to test whether models can plan actions and identify where they should happen in real-world scenarios.
Introducing Spatially Grounded Planning
Traditional systems usually split planning into two steps: first, a vision-language model generates a natural-language plan, then another model interprets it into actions. This often leads to errors, especially with complex tasks, because language can be unclear or even hallucinated. Errors in the initial plan can cascade, causing failures in execution.
The new approach asks if models can handle both what to do and where to do it simultaneously, improving overall accuracy and success. The key idea is to ground plans in spatial information, making them more precise and less ambiguous.
How GroundedPlanBench Works
GroundedPlanBench was built using 308 robot manipulation scenes from a dataset called DROID. Experts reviewed each scene and defined tasks, writing instructions in two styles: explicit commands like “put a spoon on the white plate” and more general goals like “tidy up the table.” Each task was broken down into four basic actions—grasp, place, open, and close—each tied to specific locations in the images.
This setup allows the benchmark to evaluate whether vision-language models can generate plans that include both actions and spatial information. The goal is to see if models can effectively combine understanding language with spatial reasoning to produce executable plans.
Introducing Video-to-Spatially Grounded Planning
To help models learn this skill, researchers developed Video-to-Spatially Grounded Planning (V2GP). This framework converts robot demonstration videos into training data, teaching models how actions relate to specific locations in real-world scenes. By learning from videos, models can better understand the context and spatial details needed for successful planning.
When tested with various open- and closed-source vision-language models, results showed that grounded planning for long and complex tasks remains challenging. However, V2GP demonstrated improvements in both planning accuracy and spatial grounding, validated through benchmark tests and real-world robot experiments.
This approach highlights the importance of integrating spatial reasoning into language-based planning systems, moving closer to more reliable and autonomous robot behavior in diverse environments.












What do you think?
It is nice to know your opinion. Leave a comment.