Abstract:3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city <a class="link-external link-http" href="http://level.To" rel="external noopener nofollow">this http URL</a> address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.

What problem does this paper attempt to address?

The paper primarily focuses on addressing the problem of 3D Multimodal Question Answering (3D MQA) in urban scene understanding. Specifically, the research aims to overcome two main challenges in city-scale scene understanding for existing 3D MQA tasks: 1. **Lack of spatial semantic information**: Most existing 3D MQA research concentrates on indoor home environments or outdoor autonomous driving scenarios. However, at the city scale, understanding and representing the spatial relationships between different instances remains an unresolved issue. 2. **Lack of city-level interaction information**: For city-level understanding and interaction, it is necessary to consider more types of instances and their interactions, such as buildings, vegetation, etc., rather than just vehicles and pedestrians. To address the above challenges, the research explores from two perspectives: dataset and method: - **Dataset**: A new large-scale 3D MQA dataset, City-3DQA, is proposed. This is the first 3D MQA dataset aimed at outdoor urban scene understanding. It includes rich data on urban instance segmentation, scene semantic extraction, and question-answer pair construction. - **Method**: A scene graph enhanced city-level understanding method (Sg-CityU) is proposed. This method captures the spatial relationships between instances by introducing a scene graph, thereby generating high-quality city-related answers. The main contributions of the paper include: 1. Applying 3D MQA to city-scale scene understanding for the first time to support human or agent activities in cities. 2. Proposing the City-3DQA dataset, the first 3D MQA dataset that considers urban scene semantic information and city-level interaction tasks. 3. Designing a benchmark method, Sg-CityU, which introduces spatial relationship information through a scene graph to generate high-quality city-related answers. 4. Establishing a new benchmark to evaluate the performance of existing 3D MQA methods and zero-shot methods based on large language models (LLMs) on the City-3DQA dataset. Experimental results show that the proposed Sg-CityU method achieves the best performance in terms of robustness and generalization ability. In summary, this paper provides new solutions to key issues in city-scale scene understanding and lays a solid foundation for future research.

3D Question Answering for City Scene Understanding

SQA3D: Situated Question Answering in 3D Scenes

Multi-modal Situated Reasoning in 3D Scenes

Space3D-Bench: Spatial 3D Question Answering Benchmark

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Comprehensive Visual Question Answering on Point Clouds through Compositional Scene Manipulation

GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Situational Awareness Matters in 3D Vision Language Reasoning

Multimodal 3D Reasoning Segmentation with Complex Scenes

Multi-agent Embodied Question Answering in Interactive Environments

3D-Aware Visual Question Answering about Parts, Poses and Occlusions

Generating Context-Aware Natural Answers for Questions in 3D Scenes

Towards Multimodal Multitask Scene Understanding Models for Indoor Mobile Agents

3D visual question answering based on sub-questions asymptotic reasoning

Depth and Video Segmentation Based Visual Attention for Embodied Question Answering

Multimodal Question Answering for Unified Information Extraction

UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Unifying 3D Vision-Language Understanding via Promptable Queries

Question-Aware Global-Local Video Understanding Network for Audio-Visual Question Answering

Recent Advances in Multi-modal 3D Scene Understanding: A Comprehensive Survey and Evaluation