Abstract:State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at <a class="link-external link-https" href="https://github.com/OpenAI/CLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is the limitation of existing computer vision systems being constrained by fixed predefined object categories during training. This limited form of supervision restricts the generality and usability of these systems, as additional annotated data is required to specify any other visual concepts. To solve this problem, the paper proposes a method to learn directly from raw text, leveraging the vast availability of image and text pairs on the internet for large-scale pre-training, thereby learning state-of-the-art image representations. Through this approach, the model can achieve zero-shot transfer learning by referencing learned visual concepts or describing new visual concepts through natural language without the need for specific task training. Specifically, the main contributions of the paper include: 1. **Proposing a new pre-training method**: By predicting which texts pair with which images, the model is pre-trained on a dataset of 400 million (image, text) pairs, learning state-of-the-art image representations. 2. **Achieving zero-shot transfer learning**: The pre-trained model can reference learned visual concepts or describe new visual concepts through natural language, achieving zero-shot transfer on downstream tasks. 3. **Extensive performance evaluation**: Benchmarking on over 30 existing computer vision datasets, covering OCR, action recognition in videos, geolocation, and various fine-grained object classification tasks. The results show that the model achieves non-trivial performance on most tasks and is comparable to fully supervised baseline models on many tasks. 4. **Efficient learning method**: Through a contrastive learning objective, CLIP is 4 times more efficient in zero-shot transfer learning compared to Transformer-based language models. In summary, this paper aims to overcome the limitations of existing computer vision systems by utilizing natural language supervision, achieving more general and flexible visual models.

Learning Transferable Visual Models From Natural Language Supervision

Learning transferable visual models from natural language supervision

Learning Transferable Visual Models From Natural Language Supervision, Feb. 2021

Learning transferable visual models from natural language supervision. arXiv

Learning transferable visual models from natural language supervision. arXiv 2021

K-LITE: Learning Transferable Visual Models with External Knowledge

Knowledge Transfer Across Modalities with Natural Language Supervision

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching

Learning Text-to-Video Retrieval from Image Captioning

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Data-Efficient Language-Supervised Zero-Shot Learning with Self-Distillation

Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

Scalable and Accurate Self-supervised Multimodal Representation Learning without Aligned Video and Text Data

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge.

Vision Learners Meet Web Image-Text Pairs

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision