Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding

Yuhang Liu,Boyi Sun,Guixu Zheng,Yishuo Wang,Jing Wang,Fei-Yue Wang
2024-05-24
Abstract:LiDAR sensors play a crucial role in various applications, especially in autonomous driving. Current research primarily focuses on optimizing perceptual models with point cloud data as input, while the exploration of deeper cognitive intelligence remains relatively limited. To address this challenge, parallel LiDARs have emerged as a novel theoretical framework for the next-generation intelligent LiDAR systems, which tightly integrate physical, digital, and social systems. To endow LiDAR systems with cognitive capabilities, we introduce the 3D visual grounding task into parallel LiDARs and present a novel human-computer interaction paradigm for LiDAR systems. We propose Talk2LiDAR, a large-scale benchmark dataset tailored for 3D visual grounding in autonomous driving. Additionally, we present a two-stage baseline approach and an efficient one-stage method named BEVGrounding, which significantly improves grounding accuracy by fusing coarse-grained sentence and fine-grained word embeddings with visual features. Our experiments on Talk2Car-3D and Talk2LiDAR datasets demonstrate the superior performance of BEVGrounding, laying a foundation for further research in this domain.
Computer Vision and Pattern Recognition,Human-Computer Interaction
What problem does this paper attempt to address?
The paper attempts to address the problem of how to endow LiDAR systems with cognitive abilities through the 3D visual grounding task in autonomous driving scenarios, specifically how to accurately locate target objects in a 3D scene based on textual descriptions. Current research mainly focuses on optimizing perception models, while exploration of deeper cognitive intelligence is relatively limited. To tackle this challenge, the paper introduces the concept of parallel LiDAR, a theoretical framework for a new generation of intelligent LiDAR systems that tightly integrate physical, digital, and social systems. Specifically, the main contributions of the paper include: 1. **Innovatively introducing the 3D visual grounding task into parallel LiDAR systems**, endowing LiDAR systems with cognitive abilities through human-computer interaction. 2. **Proposing a large benchmark dataset named Talk2LiDAR**, specifically designed for 3D visual grounding in autonomous driving. 3. **Developing a two-stage baseline method and an efficient single-stage method (called BEVGrounding)**, the latter significantly improves localization accuracy by integrating coarse-grained sentence embeddings and fine-grained word embeddings with visual features. Through these contributions, the paper aims to advance the research on 3D visual grounding tasks in the field of autonomous driving, particularly in terms of dataset creation and method development, laying the foundation for future research.