Abstract:We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as "Is the house ready for sleeptime?") are more challenging requiring the agent to identify multiple objects (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries and corresponding consensus object information. PGE maintains uniqueness among the generated queries, using semantic similarity via a feedback loop. We annotate the generated data for ground truth answers via a large scale user-study conducted on M-Turk, and with a high answerability rate of 97.26%, establish that LLMs are good at generating situational data. However, using the same LLM to answer the queries gives a low success rate of 46.2%; indicating that while LLMs are good at generating query data, they are poor at answering them. We use images from the VirtualHome simulator with the S-EQA queries establish an evaluation benchmark via Visual Question Answering (VQA). We report an improved accuracy of 15.31% while using queries framed from the generated object consensus for VQA over directly answering situational ones, indicating that such simplification is necessary for improved performance. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries that also uses a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents in household environments through this work.

Embodied Question Answering

Knowledge-based Embodied Question Answering

Multi-agent Embodied Question Answering in Interactive Environments

Embodied Question Answering in Photorealistic Environments With Point Cloud Perception

Embodied Referring Expression for Manipulation Question Answering in Interactive Environment

Revisiting EmbodiedQA: A Simple Baseline and Beyond.

Explore until Confident: Efficient Exploration for Embodied Question Answering

SQA3D: Situated Question Answering in 3D Scenes

S-EQA: Tackling Situational Queries in Embodied Question Answering

Learning by Asking for Embodied Visual Navigation and Task Completion

Map-based Modular Approach for Zero-shot Embodied Question Answering

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering

Space3D-Bench: Spatial 3D Question Answering Benchmark

EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

ELBA: Learning by Asking for Embodied Visual Navigation and Task Completion

Grounded Question-Answering in Long Egocentric Videos

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

AI-VQA

Situational Awareness Matters in 3D Vision Language Reasoning