CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Aman Shrivastava,Ramprasaath R. Selvaraju,Nikhil Naik,Vicente Ordonez
DOI: https://doi.org/10.48550/arXiv.2112.07133
2023-05-11
Abstract:We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: <a class="link-external link-https" href="https://github.com/4m4n5/CLIP-Lite" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue that existing vision-language pre-training models (such as CLIP) require a large number of negative sample pairs in contrastive learning, leading to high data demand, memory overhead, and computational resource consumption during training. Specifically, when optimizing its contrastive learning objective, CLIP requires a large number of negative sample pairs (non-matching image-text pairs) for each positive sample pair (image-text matching pair). This not only increases the amount of data needed for training but also requires larger batch sizes or additional negative sample repositories, thereby limiting the model's effectiveness on smaller datasets. To overcome these limitations, the paper proposes CLIP-Lite, a more information-efficient visual representation learning method achieved through feature-aligned text annotations. The main contributions of CLIP-Lite include: 1. **Reducing the need for negative samples**: CLIP-Lite requires only one negative sample pair for each positive sample pair to train effectively, significantly reducing the number of negative samples needed and lowering data demand and memory overhead during training. 2. **Improving performance**: Despite the reduced number of negative samples, CLIP-Lite demonstrates performance that is superior to or on par with CLIP in multiple benchmarks, especially on small datasets. 3. **Data efficiency**: CLIP-Lite can achieve better performance than CLIP even when pre-trained with less data, showcasing its advantage on small datasets. 4. **Downstream task performance**: CLIP-Lite excels in downstream tasks such as image classification, image retrieval, zero-shot classification, and visual localization, proving the effectiveness and generality of its pre-trained model. 5. **Unbiased visual representations**: CLIP-Lite can leverage language semantics to encourage the generation of unbiased visual representations, making it suitable for downstream tasks. Through these improvements, CLIP-Lite aims to provide a more efficient and flexible vision-language pre-training method, particularly suitable for use in scenarios with limited data resources.