FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

Xinhao Xiang,Jiawei Zhang
2023-11-07
Abstract:For 3D object detection, both camera and lidar have been demonstrated to be useful sensory devices for providing complementary information about the same scenery with data representations in different modalities, e.g., 2D RGB image vs 3D point cloud. An effective representation learning and fusion of such multi-modal sensor data is necessary and critical for better 3D object detection performance. To solve the problem, in this paper, we will introduce a novel vision transformer-based 3D object detection model, namely FusionViT. Different from the existing 3D object detection approaches, FusionViT is a pure-ViT based framework, which adopts a hierarchical architecture by extending the transformer model to embed both images and point clouds for effective representation learning. Such multi-modal data embedding representations will be further fused together via a fusion vision transformer model prior to feeding the learned features to the object detection head for both detection and localization of the 3D objects in the input scenery. To demonstrate the effectiveness of FusionViT, extensive experiments have been done on real-world traffic object detection benchmark datasets KITTI and Waymo Open. Notably, our FusionViT model can achieve state-of-the-art performance and outperforms not only the existing baseline methods that merely rely on camera images or lidar point clouds, but also the latest multi-modal image-point cloud deep fusion approaches.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address the issue of 3D object detection in autonomous driving. Specifically, it explores how to effectively fuse multimodal information from camera images and LiDAR point cloud data to improve the performance of 3D object detection. **The main contributions are as follows:** 1. **Proposed a 3D object detection framework based on a pure Vision Transformer (ViT)**: The model, named FusionViT, can effectively fuse image and point cloud data to achieve higher detection accuracy. 2. **Designed a hierarchical structure**: By segmenting images into mini-patches and point clouds into mini-cubics, CameraViT and LidarViT are introduced to learn the embedded representations of the input data. 3. **Proposed the MixViT component**: Used to fuse the representations learned by CameraViT and LidarViT, further conducting multi-level representation learning. 4. **Conducted extensive experiments on the Waymo Open Dataset and KITTI benchmark datasets**: FusionViT outperformed various existing methods on these datasets, demonstrating the potential of the pure ViT framework in future 3D object detection tasks. Through the above methods, the paper addresses the issue of single-modal data being insufficient to fully describe complex scenes and demonstrates robustness and efficiency in traffic scenarios.