Abstract:Vision and language foundation models (VLMs) have showcased impressive capabilities in 2D scene understanding. However, their latent potential in elevating the understanding of 3D autonomous driving scenes remains untapped. In this paper, we propose VLM2Scene, which exploits the potential of VLMs to enhance 3D self-supervised representation learning through our proposed image-text-LiDAR contrastive learning strategy. Specifically, in the realm of autonomous driving scenes, the inherent sparsity of LiDAR point clouds poses a notable challenge for point-level contrastive learning methods. This method often grapples with limitations tied to a restricted receptive field and the presence of noisy points. To tackle this challenge, our approach emphasizes region-level learning, leveraging regional masks without semantics derived from the vision foundation model. This approach capitalizes on valuable contextual information to enhance the learning of point cloud representations. First, we introduce Region Caption Prompts to generate fine-grained language descriptions for the corresponding regions, utilizing the language foundation model. These region prompts then facilitate the establishment of positive and negative text-point pairs within the contrastive loss framework. Second, we propose a Region Semantic Concordance Regularization, which involves a semantic-filtered region learning and a region semantic assignment strategy. The former aims to filter the false negative samples based on the semantic distance, and the latter mitigates potential inaccuracies in pixel semantics, thereby enhancing overall semantic consistency. Extensive experiments on representative autonomous driving datasets demonstrate that our self-supervised method significantly outperforms other counterparts. Codes are available at https://github.com/gbliao/VLM2Scene.

Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian

VLM2Scene: Self-Supervised Image-Text-LiDAR Learning with Foundation Models for Autonomous Driving Scene Understanding

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

Open 3D World in Autonomous Driving

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

3D Vision-Language Gaussian Splatting

SLGaussian: Fast Language Gaussian Splatting in Sparse Views

VLP: Vision Language Planning for Autonomous Driving

3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding

Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving

DriveLM: Driving with Graph Visual Question Answering

Embodied Understanding of Driving Scenarios

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

OccLLaMA: An Occupancy-Language-Action Generative World Model for Autonomous Driving

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences

Semantic Gaussians: Open-Vocabulary Scene Understanding with 3D Gaussian Splatting

Language-Image Models with 3D Understanding