Abstract:3D visual grounding (VG) aims to locate relevant objects or regions within 3D scenes based on natural language descriptions. Although recent methods for indoor 3D VG have successfully transformer-based architectures to capture global contextual information and enable fine-grained cross-modal fusion, they are unsuitable for outdoor environments due to differences in the distribution of point clouds between indoor and outdoor settings. Specifically, first, extensive LiDAR point clouds demand unacceptable computational and memory resources within transformers due to the high-dimensional visual features. Second, dominant background points and empty spaces in sparse LiDAR point clouds complicate cross-modal fusion owing to their irrelevant visual information. To address these challenges, we propose LidaRefer, a transformer-based 3D VG framework designed for large-scale outdoor scenes. Moreover, during training, we introduce a simple and effective localization method, which supervises the decoder's queries to localize not only a target object but also ambiguous objects that might be confused as the target due to the exhibition of similar attributes in a scene or the incorrect understanding of a language description. This supervision enhances the model's ability to distinguish ambiguous objects from a target by learning the differences in their spatial relationships and attributes. LidaRefer achieves state-of-the-art performance on Talk2Car-3D, a 3D VG dataset for autonomous driving, with significant improvements under various evaluation settings.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to accurately locate target objects in 3D point clouds based on natural language descriptions in large - scale outdoor scenes (3D visual grounding, 3D VG). Although existing indoor 3D VG methods have successfully utilized Transformer - based architectures to capture global context information and achieve fine - grained cross - modal fusion, these methods are not suitable for outdoor environments. Specifically, LiDAR point clouds in outdoor scenes present the following challenges: 1. **High Dimensionality and Sparsity**: A large amount of LiDAR point cloud data requires unacceptable computational and memory resources, especially when dealing with high - dimensional visual features. 2. **Background Noise and Voids**: The dominant background points and void regions in sparse point clouds contain irrelevant visual information, which makes cross - modal fusion more complex. To solve these problems, the authors propose LidaRefer, a Transformer - based 3D VG framework specifically designed for large - scale outdoor scenes. In addition, the authors introduce a simple and effective supervision method, namely **Ambiguous Object Localization**, to enhance the model's ability to distinguish between ambiguous objects and target objects. In this way, LidaRefer can achieve state - of - the - art performance on the Talk2Car - 3D dataset under various evaluation settings. ### Specific Problem Summary: 1. **Excessively High Computational and Memory Resource Requirements**: High - dimensional point cloud data in outdoor scenes impose a huge computational and memory burden on Transformer. 2. **Interference from Background Noise**: A large number of irrelevant background points and void regions make cross - modal fusion difficult and affect the accuracy of the model. 3. **Ambiguous Object Recognition**: Some non - target objects may have similar properties to the target objects or be mentioned in the description, resulting in ambiguity in recognition. By proposing LidaRefer and its innovative ambiguous object localization method, this paper aims to overcome the above challenges and thus improve the performance of outdoor 3D visual grounding tasks.

LidaRefer: Outdoor 3D Visual Grounding for Autonomous Driving with Transformers

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

A Transformer-based Real-time LiDAR Semantic Segmentation Method for Restricted Mobile Devices

Open 3D World in Autonomous Driving

WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language

Transformer-based Visual Grounding with Cross-modality Interaction

Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding

SGV3D:Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Towards Scenario Generalization for Vision-based Roadside 3D Object Detection

Visual Point Cloud Forecasting enables Scalable Autonomous Driving

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Li3DeTr: A LiDAR based 3D Detection Transformer

NDT-Transformer: Large-Scale 3D Point Cloud Localisation using the Normal Distribution Transform Representation

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection

Reimagining 3D Visual Grounding: Instance Segmentation and Transformers for Fragmented Point Cloud Scenarios.

Multi-View Transformer for 3D Visual Grounding

Dense Object Grounding in 3D Scenes