Training CLIP models on Data from Scientific Papers

Calvin Metzger
2023-11-08
Abstract:Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper primarily explores how to improve the performance of the Contrastive Language-Image Pre-training (CLIP) model by leveraging high-quality image-text data from scientific papers. Specifically, the authors collected data from two sources: the arXiv repository (which covers a wide range of quantitative fields) and PubMed Central (which provides open-access papers in the biomedical field). The aim is to investigate whether high-quality data from these specific domains can enhance the general performance of the CLIP model. The main contributions of the paper include: 1. **Data Collection**: The authors proposed a method to extract images and their corresponding textual descriptions (such as captions) from arXiv and PubMed Central, and used this data to train the CLIP model. 2. **Experimental Design**: To validate the effectiveness of this data, the authors combined it with an existing large-scale web-crawled dataset (CommonPool) and trained a small-scale CLIP model (based on the ViT B/32 architecture). They then evaluated the model's performance on multiple standard tasks to measure its generalization ability. 3. **Results Analysis**: The results showed that the overall performance of the model improved after incorporating data from arXiv and PubMed Central, although the improvement was moderate. Additionally, the performance gains were not uniformly distributed, with certain tasks (such as those involving biomedical images) showing more significant improvements. In summary, this study aims to explore the impact of high-quality domain-specific data on the performance of the CLIP model and provides preliminary evidence of the feasibility of this approach, especially for certain specific tasks. However, the authors also pointed out some limitations of the study, such as the constraint on the amount of data and potential directions for future improvements.