Abstract:Vision-language pre-training like CLIP has shown promising performance on various downstream tasks such as zero-shot image classification and image-text retrieval. Most of the existing CLIP-alike works usually adopt relatively large image encoders like ResNet50 and ViT, while the lightweight counterparts are rarely discussed. In this paper, we propose a multi-level interaction paradigm for training lightweight CLIP models. Firstly, to mitigate the problem that some image-text pairs are not strictly one-to-one correspondence, we improve the conventional global instance-level alignment objective by softening the label of negative samples progressively. Secondly, a relaxed bipartite matching based token-level alignment objective is introduced for finer-grained alignment between image patches and textual words. Moreover, based on the observation that the accuracy of CLIP model does not increase correspondingly as the parameters of text encoder increase, an extra objective of masked language modeling (MLM) is leveraged for maximizing the potential of the shortened text encoder. In practice, an auxiliary fusion module injecting unmasked image embedding into masked text embedding at different network stages is proposed for enhancing the MLM. Extensive experiments show that without introducing additional computational cost during inference, the proposed method achieves a higher performance on multiple downstream tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to deploy lightweight Vision - Language Models (VLMs), especially CLIP - like models, on resource - constrained devices. Although existing CLIP - like models have shown excellent performance in various downstream tasks, such as zero - shot image classification, image - text retrieval, etc., these models usually adopt large - scale image encoders (such as ResNet50 and ViT), which makes them difficult to be deployed on edge devices. Therefore, the paper proposes a multi - level interaction paradigm to train lightweight CLIP models, aiming to improve the performance of the models on multiple downstream tasks without increasing the computational cost during inference. Specifically, the paper mainly solves the following problems: 1. **Image - text pairs without strictly one - to - one correspondence**: Some image - text pairs crawled from the network do not have a strictly one - to - one correspondence relationship, which is particularly unfavorable for the training of lightweight CLIP models. To this end, the paper improves the traditional global instance - level alignment objective by gradually softening the negative sample labels. 2. **Fine - grained alignment**: Using only the global instance - level alignment objective is not sufficient to achieve fine - grained alignment between image patches and text words. The paper introduces a token - level alignment objective based on relaxed bipartite matching to achieve finer - grained alignment. 3. **The increase in text encoder parameters does not significantly improve performance**: It has been observed that the accuracy of CLIP models does not increase correspondingly with the increase in text encoder parameters. To this end, the paper reduces the number of layers of the text encoder and introduces the Masked Language Modeling (MLM) objective to maximize the potential of the shortened text encoder. Through the above methods, the method proposed in the paper has achieved higher performance in multiple benchmark tests while maintaining the efficiency during inference.

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Embracing Language Inclusivity and Diversity in CLIP through Continual Language Learning

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

SoftCLIP: Softer Cross-modal Alignment Makes CLIP Stronger.

Contrastive Localized Language-Image Pre-Training

Multi-Modal Adapter for Vision-Language Models

Improving CLIP Training with Language Rewrites

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Boosting Visual-Language Models by Exploiting Hard Samples

EfficientCLIP: Efficient Cross-Modal Pre-training by Ensemble Confident Learning and Language Modeling

Enhancing Vision-Language Model with Unmasked Token Alignment