Classification Done Right for Vision-Language Pre-Training

Huang Zilong,Ye Qinghao,Kang Bingyi,Feng Jiashi,Fan Haoqi
2024-11-06
Abstract:We introduce SuperClass, a super simple classification method for vision-language pre-training on image-text data. Unlike its contrastive counterpart CLIP who contrast with a text encoder, SuperClass directly utilizes tokenized raw text as supervised classification labels, without the need for additional text filtering or selection. Due to the absence of the text encoding as contrastive target, SuperClass does not require a text encoder and does not need to maintain a large batch size as CLIP does. SuperClass demonstrated superior performance on various downstream tasks, including classic computer vision benchmarks and vision language downstream tasks. We further explored the scaling behavior of SuperClass on model size, training length, or data size, and reported encouraging results and comparisons to CLIP. <a class="link-external link-https" href="https://github.com/x-cls/superclass" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that existing contrastive learning methods (such as CLIP) in vision - language pre - training require very large batch sizes and a large amount of computing resources, which limits their application among researchers with limited resources. To solve this problem, the paper proposes a new pre - training method - SuperClass. It simplifies the pre - training process and reduces the computing cost by directly using text tokens as supervised classification labels without the need for an additional text encoder or large - scale batch training, and at the same time shows performance comparable to or even better than CLIP on various downstream tasks. Specifically, the main contributions of the paper include: 1. **Simplifying the pre - training framework**: SuperClass avoids the need for large - scale batch training and text encoders required in contrastive learning methods by using the original text tokens as a supervision signal, thus greatly simplifying the pre - training framework and reducing the consumption of computing resources. 2. **Preserving text information**: Unlike methods that rely on manual rules to pre - process text, SuperClass directly uses unprocessed text tokens and preserves all the information in the text, which is valuable for representation learning. 3. **Excellent scalability**: Experiments show that the scaling behavior of SuperClass in terms of model size, training length, or data volume is comparable to or even better than CLIP, and it can achieve excellent performance on multiple downstream tasks. 4. **High competitiveness**: On the same pre - training data set, SuperClass significantly outperforms its contrastive - learning counterparts in image classification and vision - language tasks, demonstrating its potential as a competitive alternative. In conclusion, this paper aims to provide a more efficient and simpler vision - language pre - training method to reduce computing costs and improve model performance, especially in the case of limited resources.