Learning Transferable Visual Models From Natural Language Supervision

Alec Radford,Jong Wook Kim,Chris Hallacy,Aditya Ramesh,Gabriel Goh,Sandhini Agarwal,Girish Sastry,Amanda Askell,Pamela Mishkin,Jack Clark,Gretchen Krueger,Ilya Sutskever
DOI: https://doi.org/10.48550/arXiv.2103.00020
2021-02-27
Abstract:State-of-the-art computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study the performance of this approach by benchmarking on over 30 different existing computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet-50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on. We release our code and pre-trained model weights at <a class="link-external link-https" href="https://github.com/OpenAI/CLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem this paper attempts to address is the limitation of existing computer vision systems being constrained by fixed predefined object categories during training. This limited form of supervision restricts the generality and usability of these systems, as additional annotated data is required to specify any other visual concepts. To solve this problem, the paper proposes a method to learn directly from raw text, leveraging the vast availability of image and text pairs on the internet for large-scale pre-training, thereby learning state-of-the-art image representations. Through this approach, the model can achieve zero-shot transfer learning by referencing learned visual concepts or describing new visual concepts through natural language without the need for specific task training. Specifically, the main contributions of the paper include: 1. **Proposing a new pre-training method**: By predicting which texts pair with which images, the model is pre-trained on a dataset of 400 million (image, text) pairs, learning state-of-the-art image representations. 2. **Achieving zero-shot transfer learning**: The pre-trained model can reference learned visual concepts or describe new visual concepts through natural language, achieving zero-shot transfer on downstream tasks. 3. **Extensive performance evaluation**: Benchmarking on over 30 existing computer vision datasets, covering OCR, action recognition in videos, geolocation, and various fine-grained object classification tasks. The results show that the model achieves non-trivial performance on most tasks and is comparable to fully supervised baseline models on many tasks. 4. **Efficient learning method**: Through a contrastive learning objective, CLIP is 4 times more efficient in zero-shot transfer learning compared to Transformer-based language models. In summary, this paper aims to overcome the limitations of existing computer vision systems by utilizing natural language supervision, achieving more general and flexible visual models.