Abstract:We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: <a class="link-external link-https" href="https://github.com/4m4n5/CLIP-Lite" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper attempts to address the issue that existing vision-language pre-training models (such as CLIP) require a large number of negative sample pairs in contrastive learning, leading to high data demand, memory overhead, and computational resource consumption during training. Specifically, when optimizing its contrastive learning objective, CLIP requires a large number of negative sample pairs (non-matching image-text pairs) for each positive sample pair (image-text matching pair). This not only increases the amount of data needed for training but also requires larger batch sizes or additional negative sample repositories, thereby limiting the model's effectiveness on smaller datasets. To overcome these limitations, the paper proposes CLIP-Lite, a more information-efficient visual representation learning method achieved through feature-aligned text annotations. The main contributions of CLIP-Lite include: 1. **Reducing the need for negative samples**: CLIP-Lite requires only one negative sample pair for each positive sample pair to train effectively, significantly reducing the number of negative samples needed and lowering data demand and memory overhead during training. 2. **Improving performance**: Despite the reduced number of negative samples, CLIP-Lite demonstrates performance that is superior to or on par with CLIP in multiple benchmarks, especially on small datasets. 3. **Data efficiency**: CLIP-Lite can achieve better performance than CLIP even when pre-trained with less data, showcasing its advantage on small datasets. 4. **Downstream task performance**: CLIP-Lite excels in downstream tasks such as image classification, image retrieval, zero-shot classification, and visual localization, proving the effectiveness and generality of its pre-trained model. 5. **Unbiased visual representations**: CLIP-Lite can leverage language semantics to encourage the generation of unbiased visual representations, making it suitable for downstream tasks. Through these improvements, CLIP-Lite aims to provide a more efficient and flexible vision-language pre-training method, particularly suitable for use in scenarios with limited data resources.

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Improving CLIP Training with Language Rewrites

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

HyperCLIP: Adapting Vision-Language models with Hypernetworks

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

CLIPPO: Image-and-Language Understanding from Pixels Only

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

How Much Can CLIP Benefit Vision-and-Language Tasks?

Diffusion Feedback Helps CLIP See Better