Abstract:In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the information loss problem existing in the encoding process of the existing CLIP models. Specifically, CLIP models often encode input data in a very rough manner, ignoring a lot of useful fine - grained information. This defect significantly limits the ability of a single CLIP model to process images rich in visual details, leading to a decline in the performance of downstream tasks. Especially when CLIP is used as the visual encoder of a multimodal large - language model (MLLM), this information loss will further affect the performance of the model. To overcome this problem, the authors propose a new method - **Diversified Multiplet Upcycling (DMU)**, which captures diverse and complementary information by integrating multiple CLIP models into a Mixture of Experts (MoE) architecture. This method not only improves the capacity of the model but also maintains a low computational cost, thus achieving significant performance improvements in various zero - shot retrieval, zero - shot image classification tasks and benchmarks as an MLLM visual encoder. ### Main contributions: 1. **Propose the Diversified Multiplet Upcycling (DMU) method**: Generate multiple CLIP models through multi - stage contrastive learning (MCL), each model capturing different information, and integrate them into an MoE architecture, providing an effective method to expand the CLIP base model. 2. **Fully utilize high - quality data and pre - trained models**: The DMU method can fully utilize high - quality data and existing pre - trained CLIP checkpoints without retraining the model, avoiding the high computational cost of training from scratch. 3. **Extensive experimental verification**: Through a large number of experiments, it is proved that CLIP - MoE is significantly superior to the original CLIP and other baseline methods in various downstream tasks, and has a lower computational cost. ### Key technologies of the solution: - **Multi - stage contrastive learning (MCL)**: Generate multiple CLIP models that capture different information through a multi - stage clustering and contrastive learning process. - **Mixture of Experts (MoE) architecture**: Use multiple CLIP models generated by MCL to construct a sparsely - activated MoE model, and each expert is responsible for capturing different aspects of the input information. - **Continuous fine - tuning of the routing network**: Optimize the routing network through contrastive learning loss and routing balance loss to ensure the effective utilization of all experts. Through these technologies, CLIP - MoE can not only capture more abundant information, but also significantly improve the performance of the model while maintaining a low computational cost.

CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts

MoDE: CLIP Data Experts via Clustering

Geodesic Multi-Modal Mixup for Robust Fine-Tuning

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Residual Mixture of Experts

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

CLIPPO: Image-and-Language Understanding from Pixels Only

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization

HMoE: Heterogeneous Mixture of Experts for Language Modeling

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

HyperMoE: Towards Better Mixture of Experts via Transferring Among Experts

Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models