CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Chu Myaet Thwal,Ye Lin Tun,Minh N. H. Nguyen,Eui-Nam Huh,Choong Seon Hong
2024-12-05
Abstract:Beyond the success of Contrastive Language-Image Pre-training (CLIP), recent trends mark a shift toward exploring the applicability of lightweight vision-language models for resource-constrained scenarios. These models often deliver suboptimal performance when relying solely on a single image-text contrastive learning objective, spotlighting the need for more effective training mechanisms that guarantee robust cross-modal feature alignment. In this work, we propose CLIP-PING: Contrastive Language-Image Pre-training with Proximus Intrinsic Neighbors Guidance, a simple and efficient training paradigm designed to boost the performance of lightweight vision-language models with minimal computational overhead and lower data demands. CLIP-PING bootstraps unimodal features extracted from arbitrary pre-trained encoders to obtain intrinsic guidance of proximus neighbor samples, i.e., nearest-neighbor (NN) and cross nearest-neighbor (XNN). We find that extra contrastive supervision from these neighbors substantially boosts cross-modal alignment, enabling lightweight models to learn more generic features with rich semantic diversity. Extensive experiments reveal that CLIP-PING notably surpasses its peers in zero-shot generalization and cross-modal retrieval tasks. Specifically, a 5.5% gain on zero-shot ImageNet1K with 10.7% (I2T) and 5.7% (T2I) on Flickr30K, compared to the original CLIP when using ViT-XS image encoder trained on 3 million (image, text) pairs. Moreover, CLIP-PING showcases strong transferability under the linear evaluation protocol across several downstream tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of poor performance of lightweight vision - language models in resource - constrained scenarios. Specifically: 1. **Challenges in computational resources and data requirements**: - Existing multi - modal contrastive learning methods, such as CLIP (Contrastive Language - Image Pre - training), although they have achieved remarkable success on large - scale datasets, their computational burden and the need for a large amount of data make these models difficult to be applied in resource - constrained environments. - Large - scale pre - trained models usually require a large amount of computational resources and storage space, which limits their use in low - resource environments such as mobile devices and edge computing. 2. **Limitations of single - modal contrastive learning objectives**: - When lightweight models are trained relying only on a single image - text contrastive learning objective, they often cannot achieve optimal performance. This indicates that more effective training mechanisms are required to ensure the robustness of cross - modal feature alignment. 3. **Limitations of knowledge distillation methods**: - Although existing knowledge distillation methods can improve the performance of lightweight models to a certain extent, these methods usually require complex architectural constraints and high computational costs, especially when dealing with high - dimensional multi - modal representations. ### The method proposed in the paper To solve the above problems, the paper proposes CLIP - PING: a Contrastive Language - Image Pre - training with Proximus Intrinsic Neighbors Guidance. CLIP - PING improves the performance of lightweight vision - language models in the following ways: - **Utilizing features extracted by pre - trained encoders**: Extract single - modal features from any pre - trained encoder and freeze them in an auxiliary feature library to provide intrinsic guidance for neighboring samples. - **Introducing neighboring contrastive supervision**: In addition to the standard image - text contrastive learning objective, CLIP - PING also introduces extensive supervision from neighboring samples (nearest neighbor NN and cross - nearest neighbor XNN), thereby enhancing cross - modal alignment. - **Reducing computational overhead and data requirements**: By efficiently using the features of pre - trained encoders, CLIP - PING can improve model performance without increasing additional computational burden. ### Experimental results The experimental results show that CLIP - PING significantly outperforms other methods in zero - shot classification and cross - modal retrieval tasks. For example, when trained with ViT - XS image encoder and 3 million (image, text) pairs, the zero - shot classification accuracy of CLIP - PING on ImageNet1K is increased by 5.5%, and the image - to - text retrieval R@1 metric on Flickr30K is increased by 10.7% and 5.7% respectively. In summary, by introducing proximus intrinsic guidance, CLIP - PING effectively improves the performance of lightweight vision - language models in resource - constrained scenarios while maintaining low computational overhead and data requirements.