GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

Saumya Saxena,Blake Buchanan,Chris Paxton,Bingqing Chen,Narunas Vaskevicius,Luigi Palmieri,Jonathan Francis,Oliver Kroemer
2024-12-19
Abstract:In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.
Robotics,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: How can a robot answer natural - language - based questions by exploring and understanding the environment in an environment it has never seen before? Specifically, the paper targets several key challenges in the **Embodied Question Answering (EQA)** task: 1. **Obtaining useful semantic representations**: The robot needs to be able to recognize and understand objects in the environment and their relationships. 2. **Updating these semantic representations in real - time**: The robot needs to continuously update its understanding of the environment during the exploration process. 3. **Using prior world knowledge for efficient exploration and planning**: The robot needs to combine common sense and current environmental information to develop a reasonable exploration strategy. To solve these problems, the authors propose **GraphEQA**, a new method that uses real - time 3D semantic scene graphs (3DSGs) and task - related images as multi - modal memories to enhance visual - language models (VLMs) to complete the EQA task. Specifically, the main contributions of GraphEQA include: - **Constructing real - time, compact multi - modal semantic memories**: Combining global, semantically sparse, task - independent information and local, semantically rich, task - related images. - **Enriching 3DSGs**: By adding semantically enriched frontier nodes and room labels. - **Hierarchical planning method**: Using the hierarchical structure of 3DSGs for structured exploration and planning. - **Experimental verification**: Extensive experiments were carried out on the HM - EQA dataset and in real - world home and office environments, demonstrating GraphEQA's ability to achieve a higher success rate with fewer planning steps. ### Specific problem analysis 1. **How to construct and enrich 3D scene graphs?** - Use the Hydra framework to construct a hierarchical 3D metric - semantic scene graph. - Assign semantic labels to room nodes through LLM. - Enrich frontier nodes by clustering and connecting the nearest object nodes. 2. **How to select task - related visual memories?** - In each planning step, store the images most relevant to the task, and use the SigLIP model to evaluate the relevance of the images and retain the most relevant images. 3. **How to perform hierarchical planning?** - The planner selects the next action based on the current state, history, scene graph, and visual memory. - Action types include going to object nodes or frontier nodes to ensure exploration of informative areas. 4. **What are the termination conditions?** - Terminate when the planner is sufficiently confident in the answer (confidence exceeds 0.9). - Or terminate when the maximum allowed number of planning steps is reached. Through these methods, GraphEQA can efficiently complete the EQA task in an unknown environment, reducing the required planning steps and increasing the success rate of the task.