Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception

Hassan Akbari,Dan Kondratyuk,Yin Cui,Rachel Hornung,Huisheng Wang,Hartwig Adam
2023-12-12
Abstract:We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia,Image and Video Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address several key challenges in Multimodal Perception: 1. **Efficient Integration of Multimodal Data**: Different modalities of data (such as images, videos, text, and audio) have different structures and input-output signatures. How to efficiently integrate these multimodal data within the same model is a challenging problem. 2. **Scalability of Multitask Training**: As the number of tasks and datasets increases, designing a training framework that can seamlessly integrate new tasks and datasets without increasing memory and computational overhead is crucial. 3. **Optimization Conflicts Between Multimodal Tasks**: There may be optimization conflicts between tasks of different modalities. How to alleviate these conflicts through model design and training strategies to improve the overall performance of the model is another challenge. To address these issues, the paper proposes the **Integrated Multimodal Perception (IMP)** model, which combines **Alternating Gradient Descent (AGD)** and **Mixture-of-Experts (MoE)** techniques to achieve efficient and scalable multimodal multitask training. Specifically, the IMP model addresses the above problems in the following ways: - **Alternating Gradient Descent (AGD)**: By alternating gradient descent updates between different modalities, loss functions, and tasks, the model's performance is effectively improved. - **Mixture-of-Experts (MoE)**: By applying sparsification techniques on a single modality-agnostic encoder, the model's performance is significantly enhanced, and conflicts between different modalities are mitigated. - **Multiresolution Training**: By adjusting batch size or resolution to compensate for additional time tokens, efficient computation and memory usage on video data are achieved. Through these methods, the IMP model achieves competitive performance on multiple downstream tasks, especially setting a new state-of-the-art level in zero-shot video classification tasks.