Abstract:In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How can a robot answer natural - language - based questions by exploring and understanding the environment in an environment it has never seen before? Specifically, the paper targets several key challenges in the **Embodied Question Answering (EQA)** task: 1. **Obtaining useful semantic representations**: The robot needs to be able to recognize and understand objects in the environment and their relationships. 2. **Updating these semantic representations in real - time**: The robot needs to continuously update its understanding of the environment during the exploration process. 3. **Using prior world knowledge for efficient exploration and planning**: The robot needs to combine common sense and current environmental information to develop a reasonable exploration strategy. To solve these problems, the authors propose **GraphEQA**, a new method that uses real - time 3D semantic scene graphs (3DSGs) and task - related images as multi - modal memories to enhance visual - language models (VLMs) to complete the EQA task. Specifically, the main contributions of GraphEQA include: - **Constructing real - time, compact multi - modal semantic memories**: Combining global, semantically sparse, task - independent information and local, semantically rich, task - related images. - **Enriching 3DSGs**: By adding semantically enriched frontier nodes and room labels. - **Hierarchical planning method**: Using the hierarchical structure of 3DSGs for structured exploration and planning. - **Experimental verification**: Extensive experiments were carried out on the HM - EQA dataset and in real - world home and office environments, demonstrating GraphEQA's ability to achieve a higher success rate with fewer planning steps. ### Specific problem analysis 1. **How to construct and enrich 3D scene graphs?** - Use the Hydra framework to construct a hierarchical 3D metric - semantic scene graph. - Assign semantic labels to room nodes through LLM. - Enrich frontier nodes by clustering and connecting the nearest object nodes. 2. **How to select task - related visual memories?** - In each planning step, store the images most relevant to the task, and use the SigLIP model to evaluate the relevance of the images and retain the most relevant images. 3. **How to perform hierarchical planning?** - The planner selects the next action based on the current state, history, scene graph, and visual memory. - Action types include going to object nodes or frontier nodes to ensure exploration of informative areas. 4. **What are the termination conditions?** - Terminate when the planner is sufficiently confident in the answer (confidence exceeds 0.9). - Or terminate when the maximum allowed number of planning steps is reached. Through these methods, GraphEQA can efficiently complete the EQA task in an unknown environment, reducing the required planning steps and increasing the success rate of the task.

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Explore until Confident: Efficient Exploration for Embodied Question Answering

Knowledge-Based Embodied Question Answering

EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering

S-EQA: Tackling Situational Queries in Embodied Question Answering

Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering

Understanding the Role of Scene Graphs in Visual Question Answering

SQA3D: Situated Question Answering in 3D Scenes

Embodied Question Answering

Multi-agent Embodied Question Answering in Interactive Environments

Situational Awareness Matters in 3D Vision Language Reasoning

An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

3D Question Answering for City Scene Understanding

SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering

Map-based Modular Approach for Zero-shot Embodied Question Answering

Semantic-aware Dynamic Retrospective-Prospective Reasoning for Event-level Video Question Answering

Learning Situation Hyper-Graphs for Video Question Answering

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes