Framework Overview

OmniEAR employs a comprehensive four-stage automated benchmark generation pipeline that combines large language models with rule-based validation to create diverse, physically consistent scenarios. The framework comprises: (a) Scene Generation from internet corpus with semantic seeds, (b) Task Generation with skill sampling across seven categories, (c) Evaluation Logic Extraction for automated assessment, and (d) Expert Trajectory Generation with human validation.
Our EAR-Bench contains 1,500 scenarios with 64K objects and 6K attribute types, spanning diverse domains from household to industrial settings. The balanced task distribution covers single-agent tasks (Direct Command, Tool Use, Attribute Reasoning, Compound Reasoning) and multi-agent collaboration tasks (Explicit, Implicit, and Compound Collaboration), enabling systematic evaluation of embodied reasoning capabilities across increasing cognitive complexity levels.