Abstract:Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$^2$AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\% NDS and 6.5\% mIoU, respectively. The code is available at <a class="link-external link-https" href="https://github.com/hollow-503/UniM2AE" rel="external noopener nofollow">this https URL</a>.

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

UniM^2AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder

MonoMAE: Enhancing Monocular 3D Detection through Depth-Aware Masked Autoencoders

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

BEV-MAE: Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving Scenarios

Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds

UniScene: Multi-Camera Unified Pre-training via 3D Scene Reconstruction for Autonomous Driving

Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training

UniVision: A Unified Framework for Vision-Centric 3D Perception

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

Masked Autoencoders for Point Cloud Self-supervised Learning.

MultiMAE: Multi-modal Multi-task Masked Autoencoders

Masked Autoencoders in 3D Point Cloud Representation Learning

MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation

Multimodal Masked Autoencoders Learn Transferable Representations

UniBEV: Multi-modal 3D Object Detection with Uniform BEV Encoders for Robustness against Missing Sensor Modalities

SS-MAE: Spatial-Spectral Masked Auto-Encoder for Multi-Source Remote Sensing Image Classification