LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

Yihan Zeng,Da Zhang,Chunwei Wang,Zhenwei Miao,Ting Liu,Xin Zhan,Dayang Hao,Chao Ma
DOI: https://doi.org/10.1109/CVPR52688.2022.01666
2022-01-01
Abstract:LiDAR and camera are two common sensors to collect data in time for 3D object detection under the autonomous driving context. Though the complementary information across sensors and time has great potential of benefiting 3D perception, taking full advantage of sequential cross-sensor data still remains challenging. In this paper, we propose a novel LiDAR Image Fusion Transformer (LIFT) to model the mutual interaction relationship of cross-sensor data over time. LIFT learns to align the input 4D sequential cross-sensor data to achieve multi-frame multi-modal information aggregation. To alleviate computational load, we project both point clouds and images into the bird-eye-view maps to compute sparse grid-wise self-attention. LIFT also benefits from a cross-sensor and cross-time data augmentation scheme. We evaluate the proposed approach on the challenging nuScenes and Waymo datasets, where our LIFT performs well over the state-of-the-art and strong baselines.
What problem does this paper attempt to address?