Abstract:Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$^2$AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\% NDS and 6.5\% mIoU, respectively. The code is available at <a class="link-external link-https" href="https://github.com/hollow-503/UniM2AE" rel="external noopener nofollow">this https URL</a>.

Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training

Mapping medical image-text to a joint space via masked modeling

Masked Autoencoders for Point Cloud Self-supervised Learning.

Self Pre-training with Masked Autoencoders for Medical Image Classification and Segmentation

CrossMAE: Cross Modality Masked Autoencoders for Region-Aware Audio-Visual Pretraining

Multimodal Masked Autoencoders Learn Transferable Representations

MedFLIP: Medical Vision-and-Language Self-supervised Fast Pre-Training with Masked Autoencoder

MultiMAE: Multi-modal Multi-task Masked Autoencoders

M3AE: Multimodal Representation Learning for Brain Tumor Segmentation with Missing Modalities

Medical supervised masked autoencoders: Crafting a better masking strategy and efficient fine-tuning schedule for medical image classification

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Inter-Modal Masked Autoencoder for Self-Supervised Learning on Point Clouds

MiM: Mask in Mask Self-Supervised Pre-Training for 3D Medical Image Analysis

Self-supervised vision-language pretraining for Medical visual question answering

Advancing Volumetric Medical Image Segmentation via Global-Local Masked Autoencoder

PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection