Abstract:The authors propose a deep learning‐based sensor fusion framework that uses both camera and LiDAR sensors in a multi‐modal and multi‐view setting. In order to leverage both data streams, two fusion mechanisms are incorporated: element‐wise multiplication and multi‐modal factorised bilinear pooling. The authors provide a detailed study of important design choices that contribute to the performance of deep learning‐based sensor fusion frameworks such as data augmentation, multi‐task learning, and the design of convolutional architecture. Perception systems in autonomous vehicles need to accurately detect and classify objects within their surrounding environments. Numerous types of sensors are deployed on these vehicles, and the combination of such multimodal data streams can significantly boost performance. The authors introduce a novel sensor fusion framework using deep convolutional neural networks. The framework employs both camera and LiDAR sensors in a multimodal, multiview configuration. The authors leverage both data types by introducing two new innovative fusion mechanisms: element‐wise multiplication and multimodal factorised bilinear pooling. The methods improve the bird's eye view moderate average precision score by +4.97% and +8.35% on the KITTI dataset when compared to traditional fusion operators like element‐wise addition and feature map concatenation. An in‐depth analysis of key design choices impacting performance, such as data augmentation, multi‐task learning, and convolutional architecture design is offered. The study aims to pave the way for the development of more robust multimodal machine vision systems. The authors conclude the paper with qualitative results, discussing both successful and problematic cases, along with potential ways to mitigate the latter.

V2VFusion: Multimodal Fusion for Enhanced Vehicle-to-Vehicle Cooperative Perception

Collaborative Perception Method Based on Multisensor Fusion

ViT-FuseNet: MultiModal Fusion of Vision Transformer for Vehicle-Infrastructure Cooperative Perception

V2VFormer++: Multi-Modal Vehicle-to-Vehicle Cooperative Perception Via Global-Local Transformer

Cooperative Perception with Learning-Based V2V communications

Occlusion-Guided Multi-Modal Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

Multimedia Fusion at Semantic Level in Vehicle Cooperactive Perception

Fusing Onboard Modalities with V2V Information for Autonomous Driving

Radar and Camera Fusion for Multi-Task Sensing in Autonomous Driving

CoBEVFusion: Cooperative Perception with LiDAR-Camera Bird's-Eye View Fusion

Learnable fusion mechanisms for multimodal object detection in autonomous vehicles

Enhancing 3D object detection through multi-modal fusion for cooperative perception

Multi-modal Sensor Fusion for Auto Driving Perception: A Survey

MSFusion: Multilayer Sensor Fusion-Based Robust Motion Estimation

Multi-Modal and Multi-Scale Fusion 3D Object Detection of 4D Radar and LiDAR for Autonomous Driving

Collaborative Multimodal Fusion Network for Multiagent Perception

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

V2I-BEVF: Multi-modal Fusion Based on BEV Representation for Vehicle-Infrastructure Perception

A Novel Probabilistic V2X Data Fusion Framework for Cooperative Perception