FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

Xinhao Xiang,Jiawei Zhang

2023-11-07

Abstract:For 3D object detection, both camera and lidar have been demonstrated to be useful sensory devices for providing complementary information about the same scenery with data representations in different modalities, e.g., 2D RGB image vs 3D point cloud. An effective representation learning and fusion of such multi-modal sensor data is necessary and critical for better 3D object detection performance. To solve the problem, in this paper, we will introduce a novel vision transformer-based 3D object detection model, namely FusionViT. Different from the existing 3D object detection approaches, FusionViT is a pure-ViT based framework, which adopts a hierarchical architecture by extending the transformer model to embed both images and point clouds for effective representation learning. Such multi-modal data embedding representations will be further fused together via a fusion vision transformer model prior to feeding the learned features to the object detection head for both detection and localization of the 3D objects in the input scenery. To demonstrate the effectiveness of FusionViT, extensive experiments have been done on real-world traffic object detection benchmark datasets KITTI and Waymo Open. Notably, our FusionViT model can achieve state-of-the-art performance and outperforms not only the existing baseline methods that merely rely on camera images or lidar point clouds, but also the latest multi-modal image-point cloud deep fusion approaches.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address the issue of 3D object detection in autonomous driving. Specifically, it explores how to effectively fuse multimodal information from camera images and LiDAR point cloud data to improve the performance of 3D object detection. **The main contributions are as follows:** 1. **Proposed a 3D object detection framework based on a pure Vision Transformer (ViT)**: The model, named FusionViT, can effectively fuse image and point cloud data to achieve higher detection accuracy. 2. **Designed a hierarchical structure**: By segmenting images into mini-patches and point clouds into mini-cubics, CameraViT and LidarViT are introduced to learn the embedded representations of the input data. 3. **Proposed the MixViT component**: Used to fuse the representations learned by CameraViT and LidarViT, further conducting multi-level representation learning. 4. **Conducted extensive experiments on the Waymo Open Dataset and KITTI benchmark datasets**: FusionViT outperformed various existing methods on these datasets, demonstrating the potential of the pure ViT framework in future 3D object detection tasks. Through the above methods, the paper addresses the issue of single-modal data being insufficient to fully describe complex scenes and demonstrates robustness and efficiency in traffic scenarios.

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

Fusing LiDAR and Radar with Pillars Attention for 3D Object Detection

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

LIFT: Learning 4D LiDAR Image Fusion Transformer for 3D Object Detection

End-to-End Multi-View Fusion for 3D Object Detection in LiDAR Point Clouds

Filter Fusion: Camera-LiDAR Filter Fusion for 3D Object Detection with a Robust Fused Head

Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion

Cascade fusion of multi-modal and multi-source feature fusion by the attention for three-dimensional object detection

PVF-DectNet: Multi-modal 3D Detection Network Based on Perspective-Voxel Fusion

Dense Voxel Fusion for 3D Object Detection

DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Multi-View Adaptive Fusion Network for 3D Object Detection

MLF3D: Multi-Level Fusion for Multi-Modal 3D Object Detection

PLC-Fusion: Perspective-Based Hierarchical and Deep LiDAR Camera Fusion for 3D Object Detection in Autonomous Vehicles

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

AMVFNet: Attentive Multi-View Fusion Network for 3D Object Detection

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

ViT-FuseNet: MultiModal Fusion of Vision Transformer for Vehicle-Infrastructure Cooperative Perception

RI-Fusion: 3D Object Detection Using Enhanced Point Features With Range-Image Fusion for Autonomous Driving.