OmniEAR Logo

OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks

Zixuan Wang1*, Dingming Li1*, Hongxing Li1, Shuo Chen1, Yuchen Yan1, Wenqi Zhang1, Yongliang Shen1†, Weiming Lu1, Jun Xiao1, Yueting Zhuang1
1Zhejiang University
Preprint. Under review.
*Equal Contribution, Corresponding Author
OmniEAR Framework Overview

OmniEAR presents a comprehensive framework for evaluating agent reasoning in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains, revealing fundamental gaps in current language models' embodied reasoning abilities.

Abstract

Large language models excel at abstract reasoning but their capacity for embodied agent reasoning remains largely unexplored. We present OmniEAR, a comprehensive framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks. Unlike existing benchmarks that provide predefined tool sets or explicit collaboration directives, OmniEAR requires agents to dynamically acquire capabilities and autonomously determine coordination strategies based on task demands. Through text-based environment representation, we model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.

Our systematic evaluation reveals severe performance degradation when models must reason from constraints: while achieving 85-96% success with explicit instructions, performance drops to 56-85% for tool reasoning and 63-85% for implicit collaboration, with compound tasks showing over 50% failure rates. Surprisingly, complete environmental information degrades coordination performance, indicating models cannot filter task-relevant constraints. Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations.

These findings demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, establishing OmniEAR as a rigorous benchmark for evaluating and advancing embodied AI systems.

📖 For detailed documentation, installation guides, and API reference, visit: omniembodied.readthedocs.io

Framework Overview

OmniEAR Data Generation Pipeline

OmniEAR employs a comprehensive four-stage automated benchmark generation pipeline that combines large language models with rule-based validation to create diverse, physically consistent scenarios. The framework comprises: (a) Scene Generation from internet corpus with semantic seeds, (b) Task Generation with skill sampling across seven categories, (c) Evaluation Logic Extraction for automated assessment, and (d) Expert Trajectory Generation with human validation.

Our EAR-Bench contains 1,500 scenarios with 64K objects and 6K attribute types, spanning diverse domains from household to industrial settings. The balanced task distribution covers single-agent tasks (Direct Command, Tool Use, Attribute Reasoning, Compound Reasoning) and multi-agent collaboration tasks (Explicit, Implicit, and Compound Collaboration), enabling systematic evaluation of embodied reasoning capabilities across increasing cognitive complexity levels.

Main Experimental Results

Main Experimental Results Table

Performance across task categories: Our comprehensive evaluation reveals severe performance degradation when models must reason from physical constraints rather than explicit instructions. While achieving high success rates (85-96%) on direct commands, performance drops significantly for tool reasoning (56-85%) and implicit collaboration (63-85%). Advanced reasoning models like o1-preview and DeepSeek-R1 show superior logical planning capabilities but still struggle with embodied constraint grounding.

Key findings: (1) Compound tasks show over 50% failure rates across all models, (2) Complete environmental information paradoxically degrades coordination performance, (3) Fine-tuning improves single-agent tasks dramatically (0.6% to 76.3%) but yields minimal multi-agent gains (1.5% to 5.5%), exposing fundamental architectural limitations in current language models for embodied reasoning.

Key Insights

Our systematic evaluation reveals fundamental gaps in current language models' embodied reasoning abilities. The results demonstrate that embodied reasoning poses fundamentally different challenges than current models can address, requiring new architectural innovations beyond current training approaches.

🔧 Tool Reasoning Gap

Performance drops from 85-96% to 56-85% when models must infer tool needs from physical constraints rather than explicit instructions.

🤝 Collaboration Challenge

Implicit collaboration success falls to 63-85% compared to 88-92% with explicit coordination, revealing autonomous decision-making limitations.

🧩 Information Paradox

Complete environmental information degrades performance, indicating models cannot filter task-relevant constraints effectively.

Detailed Analysis

Parameter Scaling Analysis

Parameter Scaling Effects

Analysis of how model size affects embodied reasoning capabilities across different task complexities. Larger models show improved performance but still struggle with multi-agent coordination.

Step Efficiency Analysis

Step Efficiency Analysis

Relationship between reasoning steps and task success rates. Models require more steps for complex embodied reasoning but show diminishing returns beyond optimal step counts.

Additional Analysis

Environmental Information Impact

Impact of Environmental Information

Analysis of how different levels of environmental detail affect agent performance across task categories. Surprisingly, complete environmental information paradoxically degrades coordination performance, indicating that current models cannot effectively filter task-relevant constraints from irrelevant information.

Efficiency Scatter Plot
Efficiency vs. Performance

Trade-off analysis between computational cost (token consumption) and task success rates across different model architectures.

Token Consumption Analysis
Token Consumption Patterns

Analysis of computational resource usage patterns across different task complexities and model architectures in embodied reasoning scenarios.

BibTeX

@misc{wang2025omniearbenchmarkingagentreasoning,
      title={OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks}, 
      author={Zixuan Wang and Dingming Li and Hongxing Li and Shuo Chen and Yuchen Yan and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
      year={2025},
      eprint={2508.05614},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.05614}, 
}