Abstract:Contrastive vision-language models (e.g. CLIP) are typically created by updating all the parameters of a vision model and language model through contrastive training. Can such models be created by a small number of parameter updates to an already-trained language model and vision model? The literature describes techniques that can create vision-language models by updating a small number of parameters in a language model, but these require already aligned visual representations and are non-contrastive, hence unusable for latency-sensitive applications such as neural search. We explore the feasibility and benefits of parameter-efficient contrastive vision-language alignment through transfer learning: creating a model such as CLIP by minimally updating an already-trained vision and language model. We find that a minimal set of parameter updates ($<$7%) can achieve the same performance as full-model training, and updating specific components ($<$1% of parameters) can match 75% of full-model training. We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training and that parameter-efficient scaling scales with model and dataset size. Where paired-image text data is scarce but strong multilingual language models exist (e.g. low resource languages), parameter-efficient training is even preferable to full-model training. Given a fixed compute budget, parameter-efficient training allows training larger models on the same hardware, achieving equivalent performance in less time. Parameter-efficient training hence constitutes an energy-efficient and effective training strategy for contrastive vision-language models that may be preferable to the full-model training paradigm for common use cases. Code and weights at <a class="link-external link-https" href="https://github.com/codezakh/LilT" rel="external noopener nofollow">this https URL</a>.

XtremeCLIP: Extremely Parameter-efficient Tuning for Low-resource Vision Language Understanding

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

CLIPVQA:Video Quality Assessment via CLIP

Parameter-Efficient Cross-lingual Transfer of Vision and Language Models via Translation-based Alignment

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

Contrastive Alignment of Vision to Language Through Parameter-Efficient Transfer Learning