RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu,Kaicheng Yang,Xiang An,Ziyong Feng,Dongnan Liu,Weidong Cai,Jiankang Deng
2024-09-23
Abstract:Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at <a class="link-external link-https" href="https://github.com/deepglint/RWKV-CLIP" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the following issues: 1. **Data Noise Issue**: Although existing Contrastive Language-Image Pretraining (CLIP) methods improve the performance of various vision-language tasks by acquiring large-scale image-text pairs from the internet, these data contain a significant amount of noise, which affects the model's learning effectiveness. 2. **Model Architecture Improvement**: To overcome the computational complexity of Transformer models when processing high-resolution images and long sequences, the paper proposes a new model—RWKV-CLIP—that combines the parallel training advantages of Transformers with the efficient inference capabilities of Recurrent Neural Networks (RNNs). Specifically, the paper presents two main contributions: - **Diversified Description Generation Framework**: Utilizing large language models (LLMs) to integrate text from web pages, synthetic subtitles, and detection label information to generate more accurate and semantically rich descriptions. - **RWKV-CLIP Model**: The first vision-language representation learning model based on the RWKV architecture, which integrates the efficient parallel training of Transformers and the efficient inference of RNNs, significantly enhancing the performance of downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval.