LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers

Yeong-Seung Baek,Heung-Seon Oh
2024-11-07
Abstract:3D visual grounding (VG) aims to locate relevant objects or regions within 3D scenes based on natural language descriptions. Although recent methods for indoor 3D VG have successfully transformer-based architectures to capture global contextual information and enable fine-grained cross-modal fusion, they are unsuitable for outdoor environments due to differences in the distribution of point clouds between indoor and outdoor settings. Specifically, first, extensive LiDAR point clouds demand unacceptable computational and memory resources within transformers due to the high-dimensional visual features. Second, dominant background points and empty spaces in sparse LiDAR point clouds complicate cross-modal fusion owing to their irrelevant visual information. To address these challenges, we propose LidaRefer, a transformer-based 3D VG framework designed for large-scale outdoor scenes. Moreover, during training, we introduce a simple and effective localization method, which supervises the decoder's queries to localize not only a target object but also ambiguous objects that might be confused as the target due to the exhibition of similar attributes in a scene or the incorrect understanding of a language description. This supervision enhances the model's ability to distinguish ambiguous objects from a target by learning the differences in their spatial relationships and attributes. LidaRefer achieves state-of-the-art performance on Talk2Car-3D, a 3D VG dataset for autonomous driving, with significant improvements under various evaluation settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to accurately locate target objects in 3D point clouds based on natural language descriptions in large - scale outdoor scenes (3D visual grounding, 3D VG). Although existing indoor 3D VG methods have successfully utilized Transformer - based architectures to capture global context information and achieve fine - grained cross - modal fusion, these methods are not suitable for outdoor environments. Specifically, LiDAR point clouds in outdoor scenes present the following challenges: 1. **High Dimensionality and Sparsity**: A large amount of LiDAR point cloud data requires unacceptable computational and memory resources, especially when dealing with high - dimensional visual features. 2. **Background Noise and Voids**: The dominant background points and void regions in sparse point clouds contain irrelevant visual information, which makes cross - modal fusion more complex. To solve these problems, the authors propose LidaRefer, a Transformer - based 3D VG framework specifically designed for large - scale outdoor scenes. In addition, the authors introduce a simple and effective supervision method, namely **Ambiguous Object Localization**, to enhance the model's ability to distinguish between ambiguous objects and target objects. In this way, LidaRefer can achieve state - of - the - art performance on the Talk2Car - 3D dataset under various evaluation settings. ### Specific Problem Summary: 1. **Excessively High Computational and Memory Resource Requirements**: High - dimensional point cloud data in outdoor scenes impose a huge computational and memory burden on Transformer. 2. **Interference from Background Noise**: A large number of irrelevant background points and void regions make cross - modal fusion difficult and affect the accuracy of the model. 3. **Ambiguous Object Recognition**: Some non - target objects may have similar properties to the target objects or be mentioned in the description, resulting in ambiguity in recognition. By proposing LidaRefer and its innovative ambiguous object localization method, this paper aims to overcome the above challenges and thus improve the performance of outdoor 3D visual grounding tasks.