Abstract:We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. <a class="link-external link-https" href="https://github.com/x-cls/superclass" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that existing contrastive learning methods (such as CLIP) in vision - language pre - training require very large batch sizes and a large amount of computing resources, which limits their application among researchers with limited resources. To solve this problem, the paper proposes a new pre - training method - SuperClass. It simplifies the pre - training process and reduces the computing cost by directly using text tokens as supervised classification labels without the need for an additional text encoder or large - scale batch training, and at the same time shows performance comparable to or even better than CLIP on various downstream tasks. Specifically, the main contributions of the paper include: 1. **Simplifying the pre - training framework**: SuperClass avoids the need for large - scale batch training and text encoders required in contrastive learning methods by using the original text tokens as a supervision signal, thus greatly simplifying the pre - training framework and reducing the consumption of computing resources. 2. **Preserving text information**: Unlike methods that rely on manual rules to pre - process text, SuperClass directly uses unprocessed text tokens and preserves all the information in the text, which is valuable for representation learning. 3. **Excellent scalability**: Experiments show that the scaling behavior of SuperClass in terms of model size, training length, or data volume is comparable to or even better than CLIP, and it can achieve excellent performance on multiple downstream tasks. 4. **High competitiveness**: On the same pre - training data set, SuperClass significantly outperforms its contrastive - learning counterparts in image classification and vision - language tasks, demonstrating its potential as a competitive alternative. In conclusion, this paper aims to provide a more efficient and simpler vision - language pre - training method to reduce computing costs and improve model performance, especially in the case of limited resources.

Classification Done Right for Vision-Language Pre-Training

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Non-Contrastive Learning Meets Language-Image Pre-Training

SuS-X: Training-Free Name-Only Transfer of Vision-Language Models

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Improving CLIP Training with Language Rewrites

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions

Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

RankCLIP: Ranking-Consistent Language-Image Pretraining

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

Transductive Clip with Class-Conditional Contrastive Learning

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

ChatGPT-Powered Hierarchical Comparisons for Image Classification