OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang^1*, Dingming Li^1*, Hongxing Li¹, Shuo Chen¹, Yuchen Yan¹, Wenqi Zhang¹, Yongliang Shen^1†, Weiming Lu¹, Jun Xiao¹, Yueting Zhuang¹

¹Zhejiang University
Preprint. Under review.
^*Equal Contribution, ^†Corresponding Author

arXiv Code Dataset HuggingFace Docs Try it (soon)

OmniEAR presents a comprehensive framework for evaluating agent reasoning in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains, revealing fundamental gaps in current language models' embodied reasoning abilities.

Abstract

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.

Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations.

These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems.

📖 For detailed documentation, installation guides, and API reference, visit: omniembodied.readthedocs.io

Framework Overview

OmniEAR employs a comprehensive four-stage automated benchmark generation pipeline that combines large language models with rule-based validation to create diverse, physically consistent scenarios. The framework comprises: (a) Scene Generation from internet corpus with semantic seeds, (b) Task Generation with skill sampling across seven categories, (c) Evaluation Logic Extraction for automated assessment, and (d) Expert Trajectory Generation with human validation.

Our EAR-Bench contains 1,500 scenarios with 64K objects and 6K attribute types, spanning diverse domains from household to industrial settings. The balanced task distribution covers single-agent tasks (Direct Command, Tool Use, Attribute Reasoning, Compound Reasoning) and multi-agent collaboration tasks (Explicit, Implicit, and Compound Collaboration), enabling systematic evaluation of embodied reasoning capabilities across increasing cognitive complexity levels.

Benchmark Task Categories

Task: Place the red cup on the kitchen table.

Task Type: Direct Command (L1 - Basic)
Description: Straightforward instruction following requiring basic object manipulation and spatial understanding.

Single-Agent Task - Basic Level

Task: Clean the dirty table in the living room.

Task Type: Tool Use (L2 - Intermediate)
Description: Requires recognizing capability gaps, locating appropriate cleaning tools, and dynamically expanding action capabilities through tool acquisition. Agents must identify that cleaning actions are unavailable in their base action set and acquire the necessary tools.

Single-Agent Task - Intermediate Level

Task: Move the heaviest box to the storage room.

Task Type: Attribute Reasoning (L2 - Intermediate)
Description: Requires comparing continuous physical properties (weight) across multiple objects to identify the correct target for manipulation. Agents must solve optimization problems over object attributes.

Single-Agent Task - Intermediate Level

Task: Clean the heaviest table in the room.

Task Type: Compound Reasoning (L3 - Advanced)
Description: Integrates multiple challenges including attribute comparison, tool acquisition, and multi-step planning. Requires simultaneous reasoning about object properties and capability requirements.

Single-Agent Task - Advanced Level

Task: Agent A and Agent B cooperate to move the heavy dining table.

Task Type: Explicit Collaboration (L1 - Basic)
Description: Clear coordination directives provided, testing fundamental multi-agent synchronization and joint action execution. Agents receive explicit instructions about collaboration requirements.

Multi-Agent Task - Basic Level

Task: Move the piano to the music room.

Task Type: Implicit Collaboration (L2 - Intermediate)
Description: No explicit coordination instructions. Agents must autonomously recognize when tasks exceed individual capabilities and initiate collaborative effort based on physical constraints.

Multi-Agent Task - Intermediate Level

Task: Cooperatively repair the malfunctioning television.

Task Type: Compound Collaboration (L3 - Advanced)
Description: Combines all elements including tool acquisition, capability assessment, and coordinated execution. Requires autonomous recognition of collaboration needs and complex multi-agent planning.

Multi-Agent Task - Advanced Level

Main Experimental Results

Performance across task categories: Our comprehensive evaluation reveals severe performance degradation when models must reason from physical constraints rather than explicit instructions. While achieving high success rates (85-96%) on direct commands, performance drops significantly for tool reasoning (56-85%) and implicit collaboration (63-85%). Advanced reasoning models like o1-preview and DeepSeek-R1 show superior logical planning capabilities but still struggle with embodied constraint grounding.

Key findings: (1) Compound tasks show over 50% failure rates across all models, (2) Complete environmental information paradoxically degrades coordination performance, (3) Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations in current language models for embodied reasoning.

Key Insights

Our systematic evaluation reveals fundamental gaps in current language models' embodied reasoning abilities. The results demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, requiring new architectural innovations beyond current training approaches.

🔧 Tool Reasoning Gap

Performance drops from 85-96% to 56-85% when models must infer tool needs from physical constraints rather than explicit instructions.

🤝 Collaboration Challenge

Implicit collaboration success falls to 63-85% compared to 88-92% with explicit coordination, revealing autonomous decision-making limitations.

🧩 Information Paradox

Complete environmental information degrades performance, indicating models cannot filter task-relevant constraints effectively.

Detailed Analysis

Parameter Scaling Effects

Analysis of how model size affects embodied reasoning capabilities across different task complexities. Larger models show improved performance but still struggle with multi-agent coordination.

Step Efficiency Analysis

Relationship between reasoning steps and task success rates. Models require more steps for complex embodied reasoning but show diminishing returns beyond optimal step counts.

Additional Analysis

Impact of Environmental Information

Analysis of how different levels of environmental detail affect agent performance across task categories. Surprisingly, complete environmental information paradoxically degrades coordination performance, indicating that current models cannot effectively filter task-relevant constraints from irrelevant information.

Efficiency vs. Performance

Trade-off analysis between computational cost (token consumption) and task success rates across different model architectures.

Token Consumption Patterns

Analysis of computational resource usage patterns across different task complexities and model architectures in embodied reasoning scenarios.

BibTeX

@misc{wang2025omniearbenchmarkingagentreasoning,
      title={OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks}, 
      author={Zixuan Wang and Dingming Li and Hongxing Li and Shuo Chen and Yuchen Yan and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
      year={2025},
      eprint={2508.05614},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.05614}, 
}