Abstract:We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as "Is the house ready for sleeptime?") are more challenging requiring the agent to identify multiple objects (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries and corresponding consensus object information. PGE maintains uniqueness among the generated queries, using semantic similarity via a feedback loop. We annotate the generated data for ground truth answers via a large scale user-study conducted on M-Turk, and with a high answerability rate of 97.26%, establish that LLMs are good at generating situational data. However, using the same LLM to answer the queries gives a low success rate of 46.2%; indicating that while LLMs are good at generating query data, they are poor at answering them. We use images from the VirtualHome simulator with the S-EQA queries establish an evaluation benchmark via Visual Question Answering (VQA). We report an improved accuracy of 15.31% while using queries framed from the generated object consensus for VQA over directly answering situational ones, indicating that such simplification is necessary for improved performance. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries that also uses a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents in household environments through this work.

"Is This It?": Towards Ecologically Valid Benchmarks for Situated Collaboration

Exploring and Analyzing Machine Commonsense Benchmarks

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

Towards Collaborative Question Answering: A Preliminary Study

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents

Benchmarking Foundation Models with Language-Model-as-an-Examiner

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

Annotator in the Loop: A Case Study of In-Depth Rater Engagement to Create a Bridging Benchmark Dataset

LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning

ECBD: Evidence-Centered Benchmark Design for NLP

SocialIQA: Commonsense Reasoning about Social Interactions

BENCHAGENTS: Automated Benchmark Creation with Agent Interaction

S-EQA: Tackling Situational Queries in Embodied Question Answering

CulturalTeaming: AI-Assisted Interactive Red-Teaming for Challenging LLMs' (Lack of) Multicultural Knowledge

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Emergent Communication in Interactive Sketch Question Answering

BEHAVIOR in Habitat 2.0: Simulator-Independent Logical Task Description for Benchmarking Embodied AI Agents

EgoTaskQA: Understanding Human Tasks in Egocentric Videos

Space3D-Bench: Spatial 3D Question Answering Benchmark