CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling

Jihai Zhang,Xiaoye Qu,Tong Zhu,Yu Cheng
2024-10-03
Abstract:In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the information loss problem existing in the encoding process of the existing CLIP models. Specifically, CLIP models often encode input data in a very rough manner, ignoring a lot of useful fine - grained information. This defect significantly limits the ability of a single CLIP model to process images rich in visual details, leading to a decline in the performance of downstream tasks. Especially when CLIP is used as the visual encoder of a multimodal large - language model (MLLM), this information loss will further affect the performance of the model. To overcome this problem, the authors propose a new method - **Diversified Multiplet Upcycling (DMU)**, which captures diverse and complementary information by integrating multiple CLIP models into a Mixture of Experts (MoE) architecture. This method not only improves the capacity of the model but also maintains a low computational cost, thus achieving significant performance improvements in various zero - shot retrieval, zero - shot image classification tasks and benchmarks as an MLLM visual encoder. ### Main contributions: 1. **Propose the Diversified Multiplet Upcycling (DMU) method**: Generate multiple CLIP models through multi - stage contrastive learning (MCL), each model capturing different information, and integrate them into an MoE architecture, providing an effective method to expand the CLIP base model. 2. **Fully utilize high - quality data and pre - trained models**: The DMU method can fully utilize high - quality data and existing pre - trained CLIP checkpoints without retraining the model, avoiding the high computational cost of training from scratch. 3. **Extensive experimental verification**: Through a large number of experiments, it is proved that CLIP - MoE is significantly superior to the original CLIP and other baseline methods in various downstream tasks, and has a lower computational cost. ### Key technologies of the solution: - **Multi - stage contrastive learning (MCL)**: Generate multiple CLIP models that capture different information through a multi - stage clustering and contrastive learning process. - **Mixture of Experts (MoE) architecture**: Use multiple CLIP models generated by MCL to construct a sparsely - activated MoE model, and each expert is responsible for capturing different aspects of the input information. - **Continuous fine - tuning of the routing network**: Optimize the routing network through contrastive learning loss and routing balance loss to ensure the effective utilization of all experts. Through these technologies, CLIP - MoE can not only capture more abundant information, but also significantly improve the performance of the model while maintaining a low computational cost.