Abstract:Existing top-performance autonomous driving systems typically rely on the multi-modal fusion strategy for reliable scene understanding. This design is however fundamentally restricted due to overlooking the modality-specific strengths and finally hampering the model performance. To address this limitation, in this work, we introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout, enabling their unique characteristics to be exploited during the whole perception pipeline. To demonstrate the effectiveness of the proposed strategy, we design DeepInteraction++, a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder. Specifically, the encoder is implemented as a dual-stream Transformer with specialized attention operation for information exchange and integration between separate modality-specific representations. Our multi-modal representational learning incorporates both object-centric, precise sampling-based feature alignment and global dense information spreading, essential for the more challenging planning task. The decoder is designed to iteratively refine the predictions by alternately aggregating information from separate representations in a unified modality-agnostic manner, realizing multi-modal predictive interaction. Extensive experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks. Our code is available at <a class="link-external link-https" href="https://github.com/fudan-zvg/DeepInteraction" rel="external noopener nofollow">this https URL</a>.

A Camera-Based End-to-End Autonomous Driving Framework Combined with Meta-Based Multi-task Optimization

Multi-Camera Object Fusion Tracking Model for Autonomous Driving.

End-to-End Autonomous Driving With Semantic Depth Cloud Mapping and Multi-Agent

Multi-View Fusion of Sensor Data for Improved Perception and Prediction in Autonomous Driving

A Versatile and Efficient Reinforcement Learning Framework for Autonomous Driving

Multi-Modal Sensor Fusion-Based Deep Neural Network for End-to-End Autonomous Driving With Scene Understanding

Drive Anywhere: Generalizable End-to-end Autonomous Driving with Multi-modal Foundation Models

Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline

Multi-Task Deep Learning Model for Autonomous Driving: Object Detection, Semantic Segmentation, and Depth Estimation

ModEL: A Modularized End-to-end Reinforcement Learning Framework for Autonomous Driving

Multimodal End-to-End Autonomous Driving

DeepInteraction++: Multi-Modality Interaction for Autonomous Driving

Multi-Task Learning in Autonomous Driving Scenarios Via Adaptive Feature Refinement Networks

MMFN: Multi-Modal-Fusion-Net for End-to-End Driving

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

LiDAR-as-Camera for End-to-End Driving

Integrating Modular Pipelines with End-to-End Learning: A Hybrid Approach for Robust and Reliable Autonomous Driving Systems