3D Question Answering for City Scene Understanding

Penglei Sun,Yaoxian Song,Xiang Liu,Xiaofei Yang,Qiang Wang,Tiefeng Li,Yang Yang,Xiaowen Chu
2024-07-25
Abstract:3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks and outdoor roadside autonomous driving tasks, there has been limited exploration of city-level scene understanding tasks. Furthermore, existing research faces challenges in understanding city scenes, due to the absence of spatial semantic information and human-environment interaction information at the city <a class="link-external link-http" href="http://level.To" rel="external noopener nofollow">this http URL</a> address these challenges, we investigate 3D MQA from both dataset and method perspectives. From the dataset perspective, we introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding, which is the first dataset to incorporate scene semantic and human-environment interactive tasks within the city. From the method perspective, we propose a Scene graph enhanced City-level Understanding method (Sg-CityU), which utilizes the scene graph to introduce the spatial semantic. A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA. Compared to indoor 3D MQA methods and zero-shot using advanced large language models (LLMs), Sg-CityU demonstrates state-of-the-art (SOTA) performance in robustness and generalization.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily focuses on addressing the problem of 3D Multimodal Question Answering (3D MQA) in urban scene understanding. Specifically, the research aims to overcome two main challenges in city-scale scene understanding for existing 3D MQA tasks: 1. **Lack of spatial semantic information**: Most existing 3D MQA research concentrates on indoor home environments or outdoor autonomous driving scenarios. However, at the city scale, understanding and representing the spatial relationships between different instances remains an unresolved issue. 2. **Lack of city-level interaction information**: For city-level understanding and interaction, it is necessary to consider more types of instances and their interactions, such as buildings, vegetation, etc., rather than just vehicles and pedestrians. To address the above challenges, the research explores from two perspectives: dataset and method: - **Dataset**: A new large-scale 3D MQA dataset, City-3DQA, is proposed. This is the first 3D MQA dataset aimed at outdoor urban scene understanding. It includes rich data on urban instance segmentation, scene semantic extraction, and question-answer pair construction. - **Method**: A scene graph enhanced city-level understanding method (Sg-CityU) is proposed. This method captures the spatial relationships between instances by introducing a scene graph, thereby generating high-quality city-related answers. The main contributions of the paper include: 1. Applying 3D MQA to city-scale scene understanding for the first time to support human or agent activities in cities. 2. Proposing the City-3DQA dataset, the first 3D MQA dataset that considers urban scene semantic information and city-level interaction tasks. 3. Designing a benchmark method, Sg-CityU, which introduces spatial relationship information through a scene graph to generate high-quality city-related answers. 4. Establishing a new benchmark to evaluate the performance of existing 3D MQA methods and zero-shot methods based on large language models (LLMs) on the City-3DQA dataset. Experimental results show that the proposed Sg-CityU method achieves the best performance in terms of robustness and generalization ability. In summary, this paper provides new solutions to key issues in city-scale scene understanding and lays a solid foundation for future research.