Abstract:In this study, we explore the sophisticated domain of task planning for robust household embodied agents, with a particular emphasis on the intricate task of selecting substitute objects. We introduce the CommonSense Object Affordance Task (COAT), a novel framework designed to analyze reasoning capabilities in commonsense scenarios. This approach is centered on understanding how these agents can effectively identify and utilize alternative objects when executing household tasks, thereby offering insights into the complexities of practical decision-making in real-world environments. Drawing inspiration from factors affecting human decision-making, we explore how large language models tackle this challenge through four meticulously crafted commonsense question-and-answer datasets featuring refined rules and human annotations. Our evaluation of state-of-the-art language models on these datasets sheds light on three pivotal considerations: 1) aligning an object's inherent utility with the task at hand, 2) navigating contextual dependencies (societal norms, safety, appropriateness, and efficiency), and 3) accounting for the current physical state of the object. To maintain accessibility, we introduce five abstract variables reflecting an object's physical condition, modulated by human insights, to simulate diverse household scenarios. Our contributions include insightful human preference mappings for all three factors and four extensive QA datasets (2K, 15k, 60k, 70K questions) probing the intricacies of utility dependencies, contextual dependencies and object physical states. The datasets, along with our findings, are accessible at: <a class="link-external link-https" href="https://github.com/Ayush8120/COAT" rel="external noopener nofollow">this https URL</a>. This research not only advances our understanding of physical commonsense reasoning in language models but also paves the way for future improvements in household agent intelligence.

PROST: Physical Reasoning of Objects through Space and Time

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning

A Benchmark for Modeling Violation-of-Expectation in Physical Reasoning Across Event Categories

Physical Reasoning and Object Planning for Household Embodied Agents

Probing Physical Reasoning with Counter-Commonsense Context

Compositional Physical Reasoning of Objects and Events from Videos

SAT: Spatial Aptitude Training for Multimodal Language Models

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

Benchmarks for Physical Reasoning AI

Space3D-Bench: Spatial 3D Question Answering Benchmark

ContPhy: Continuum Physical Concept Learning and Reasoning from Videos

Octopi: Object Property Reasoning with Large Tactile-Language Models

VIPHY: Probing "Visible" Physical Commonsense Knowledge

Probing Physics Knowledge Using Tools from Developmental Psychology

3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models

AVoE: A Synthetic 3D Dataset on Understanding Violation of Expectation for Artificial Cognition

CLEVR-POC: Reasoning-Intensive Visual Question Answering in Partially Observable Environments

BRAINTEASER: Lateral Thinking Puzzles for Large Language Models

ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models