Abstract:Research on general multimodal systems has gained significant attention due to the proliferation of multimodal data in the real world. Despite the remarkable performance achieved by existing multimodal representation learning schemes, missing modalities remain a persistent issue, thereby limiting the overall applicability of multimodal systems. Intending to address the issue, we propose a novel approach named M 3 ixup (Multi-Modal Mixup), which leverages the mixup strategy to improve unimodal and multimodal representation learning while simultaneously increasing robustness against missing modalities. First, we adopt productive multimodal learning scheme to model representations with modality-specific and joint-modality encoders. The general scheme ensuring the proposed approach transferable for various multimodal learning scenarios, including supervised, unsupervised, and reinforcement learning. Then, the unimodal input and manifold mixup is used to enhance the modality-specific encoders to capture intra-modal dynamics. Next, we present multimodal mixup to mix different modalities and generate mixed multimodal representations in adapting and exploring steps. The former step aims at bridging the huge information gaps between unimodal and multimodal representations in the joint space in the alignment, while the latter step further captures the inter-modal dynamics and exploits the non-linear relationships among different modalities. After that, the mixed views are aligned with the original multimodal representations by contrastive learning. Additionally, we innovatively extend the mixup strategy to the loss function of multimodal contrastive learning in two steps to improve the alignment between mixed and original views. Extensive experiments on public datasets in various multimodal learning scenarios demonstrate the superiority of the proposed M 3 ixup. The codes are available at https://github.com/RH-Lin/m3ixup.

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Dense Multimodal Alignment for Open-Vocabulary 3D Scene Understanding

MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding

JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues

HybridVocab: Towards Multi-Modal Machine Translation Via Multi-Aspect Alignment

Adapt and explore: Multimodal mixup for representation learning

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

MIXER: A Principled Framework for Multimodal, Multiway Data Association

Mix-Teaching: A Simple, Unified and Effective Semi-Supervised Learning Framework for Monocular 3D Object Detection

FULLER: Unified Multi-modality Multi-task 3D Perception Via Multi-level Gradient Calibration.

MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking

SMMix: Self-Motivated Image Mixing for Vision Transformers

MolMix: A Simple Yet Effective Baseline for Multimodal Molecular Representation Learning

Unifying Visual and Semantic Feature Spaces with Diffusion Models for Enhanced Cross-Modal Alignment

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training

Complex Mixer for MedMNIST Classification Decathlon

PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Med3DInsight: Enhancing 3D Medical Image Understanding with 2D Multi-Modal Large Language Models