Abstract:Contrastive Language-Image Pretraining (CLIP) models are able to capture the semantic relationship of images and texts and have enabled a wide range of applications, from image retrieval to classification. These models are trained with datasets extracted from web crawls, which are of large quantity but limited quality. This paper explores whether limited amounts higher quality data in a specific domain improve the general performance of CLIP models. To this purpose, we extract text-image data from scientific papers hosted in the arXiv and PubMed Central repositories. Experiments on small-scale CLIP models (ViT B/32) show that model performance increases on average, but only moderately. This result indicates that using the data sources considered in the paper to train large-scale CLIP models is a worthwile research direction.

What problem does this paper attempt to address?

The paper primarily explores how to improve the performance of the Contrastive Language-Image Pre-training (CLIP) model by leveraging high-quality image-text data from scientific papers. Specifically, the authors collected data from two sources: the arXiv repository (which covers a wide range of quantitative fields) and PubMed Central (which provides open-access papers in the biomedical field). The aim is to investigate whether high-quality data from these specific domains can enhance the general performance of the CLIP model. The main contributions of the paper include: 1. **Data Collection**: The authors proposed a method to extract images and their corresponding textual descriptions (such as captions) from arXiv and PubMed Central, and used this data to train the CLIP model. 2. **Experimental Design**: To validate the effectiveness of this data, the authors combined it with an existing large-scale web-crawled dataset (CommonPool) and trained a small-scale CLIP model (based on the ViT B/32 architecture). They then evaluated the model's performance on multiple standard tasks to measure its generalization ability. 3. **Results Analysis**: The results showed that the overall performance of the model improved after incorporating data from arXiv and PubMed Central, although the improvement was moderate. Additionally, the performance gains were not uniformly distributed, with certain tasks (such as those involving biomedical images) showing more significant improvements. In summary, this study aims to explore the impact of high-quality domain-specific data on the performance of the CLIP model and provides preliminary evidence of the feasibility of this approach, especially for certain specific tasks. However, the authors also pointed out some limitations of the study, such as the constraint on the amount of data and potential directions for future improvements.

Training CLIP models on Data from Scientific Papers

Demystifying CLIP Data

PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents

Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

Improving CLIP Training with Language Rewrites

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

From Scarcity to Efficiency: Improving CLIP Training via Visual-enriched Captions

CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning

GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training

EVA-CLIP: Improved Training Techniques for CLIP at Scale

CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement

PAPERCLIP: Associating Astronomical Observations and Natural Language with Multi-Modal Models

Do CLIPs Always Generalize Better than ImageNet Models?

MedCLIP: Contrastive Learning from Unpaired Medical Images and Text

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

Non-Contrastive Learning Meets Language-Image Pre-Training