Abstract:We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address several key challenges in Multimodal Perception: 1. **Efficient Integration of Multimodal Data**: Different modalities of data (such as images, videos, text, and audio) have different structures and input-output signatures. How to efficiently integrate these multimodal data within the same model is a challenging problem. 2. **Scalability of Multitask Training**: As the number of tasks and datasets increases, designing a training framework that can seamlessly integrate new tasks and datasets without increasing memory and computational overhead is crucial. 3. **Optimization Conflicts Between Multimodal Tasks**: There may be optimization conflicts between tasks of different modalities. How to alleviate these conflicts through model design and training strategies to improve the overall performance of the model is another challenge. To address these issues, the paper proposes the **Integrated Multimodal Perception (IMP)** model, which combines **Alternating Gradient Descent (AGD)** and **Mixture-of-Experts (MoE)** techniques to achieve efficient and scalable multimodal multitask training. Specifically, the IMP model addresses the above problems in the following ways: - **Alternating Gradient Descent (AGD)**: By alternating gradient descent updates between different modalities, loss functions, and tasks, the model's performance is effectively improved. - **Mixture-of-Experts (MoE)**: By applying sparsification techniques on a single modality-agnostic encoder, the model's performance is significantly enhanced, and conflicts between different modalities are mitigated. - **Multiresolution Training**: By adjusting batch size or resolution to compensate for additional time tokens, efficient computation and memory usage on video data are achieved. Through these methods, the IMP model achieves competitive performance on multiple downstream tasks, especially setting a new state-of-the-art level in zero-shot video classification tasks.

Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Improving Multimodal Learning with Multi-Loss Gradient Modulation

Balanced Multimodal Learning via On-the-fly Gradient Modulation

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

On-the-fly Modulation for Balanced Multimodal Learning

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Improving Unimodal Inference with Multimodal Transformers

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification

Classifier-guided Gradient Modulation for Enhanced Multimodal Learning

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Multimodal Instruction Tuning with Hybrid State Space Models

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning

Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Residual Mixture of Experts

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

MoTE: Reconciling Generalization with Specialization for Visual-Language to Video Knowledge Transfer