Abstract:Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K with 10.7% (I2T) and 5.7% (T2I) on Flickr30K, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases strong transferability under the linear evaluation protocol across several downstream tasks.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of poor performance of lightweight vision - language models in resource - constrained scenarios. Specifically: 1. **Challenges in computational resources and data requirements**: - Existing multi - modal contrastive learning methods, such as CLIP (Contrastive Language - Image Pre - training), although they have achieved remarkable success on large - scale datasets, their computational burden and the need for a large amount of data make these models difficult to be applied in resource - constrained environments. - Large - scale pre - trained models usually require a large amount of computational resources and storage space, which limits their use in low - resource environments such as mobile devices and edge computing. 2. **Limitations of single - modal contrastive learning objectives**: - When lightweight models are trained relying only on a single image - text contrastive learning objective, they often cannot achieve optimal performance. This indicates that more effective training mechanisms are required to ensure the robustness of cross - modal feature alignment. 3. **Limitations of knowledge distillation methods**: - Although existing knowledge distillation methods can improve the performance of lightweight models to a certain extent, these methods usually require complex architectural constraints and high computational costs, especially when dealing with high - dimensional multi - modal representations. ### The method proposed in the paper To solve the above problems, the paper proposes CLIP - PING: a Contrastive Language - Image Pre - training with Proximus Intrinsic Neighbors Guidance. CLIP - PING improves the performance of lightweight vision - language models in the following ways: - **Utilizing features extracted by pre - trained encoders**: Extract single - modal features from any pre - trained encoder and freeze them in an auxiliary feature library to provide intrinsic guidance for neighboring samples. - **Introducing neighboring contrastive supervision**: In addition to the standard image - text contrastive learning objective, CLIP - PING also introduces extensive supervision from neighboring samples (nearest neighbor NN and cross - nearest neighbor XNN), thereby enhancing cross - modal alignment. - **Reducing computational overhead and data requirements**: By efficiently using the features of pre - trained encoders, CLIP - PING can improve model performance without increasing additional computational burden. ### Experimental results The experimental results show that CLIP - PING significantly outperforms other methods in zero - shot classification and cross - modal retrieval tasks. For example, when trained with ViT - XS image encoder and 3 million (image, text) pairs, the zero - shot classification accuracy of CLIP - PING on ImageNet1K is increased by 5.5%, and the image - to - text retrieval R@1 metric on Flickr30K is increased by 10.7% and 5.7% respectively. In summary, by introducing proximus intrinsic guidance, CLIP - PING effectively improves the performance of lightweight vision - language models in resource - constrained scenarios while maintaining low computational overhead and data requirements.

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

Non-Contrastive Learning Meets Language-Image Pre-Training

CLIPPO: Image-and-Language Understanding from Pixels Only

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Improving CLIP Training with Language Rewrites

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

How Much Can CLIP Benefit Vision-and-Language Tasks?

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Contrastive Localized Language-Image Pre-Training

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

RankCLIP: Ranking-Consistent Language-Image Pretraining

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining