RWKV-CLIP: A Robust Vision-Language Representation Learner

Tiancheng Gu,Kaicheng Yang,Xiang An,Ziyong Feng,Dongnan Liu,Weidong Cai,Jiankang Deng

2024-09-23

Abstract:Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the dataset with image-text pairs obtained from websites. This paper further explores CLIP from the perspectives of data and model architecture. To address the prevalence of noisy data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a diverse description generation framework that can leverage Large Language Models (LLMs) to synthesize and refine content from web-based texts, synthetic captions, and detection tags. Furthermore, we propose RWKV-CLIP, the first RWKV-driven vision-language representation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Comprehensive experiments across various model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust and efficient vision-language representation learner, it achieves state-of-the-art performance in several downstream tasks, including linear probe, zero-shot classification, and zero-shot image-text retrieval. To facilitate future research, the code and pre-trained models are released at <a class="link-external link-https" href="https://github.com/deepglint/RWKV-CLIP" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Data Noise Issue**: Although existing Contrastive Language-Image Pretraining (CLIP) methods improve the performance of various vision-language tasks by acquiring large-scale image-text pairs from the internet, these data contain a significant amount of noise, which affects the model's learning effectiveness. 2. **Model Architecture Improvement**: To overcome the computational complexity of Transformer models when processing high-resolution images and long sequences, the paper proposes a new model—RWKV-CLIP—that combines the parallel training advantages of Transformers with the efficient inference capabilities of Recurrent Neural Networks (RNNs). Specifically, the paper presents two main contributions: - **Diversified Description Generation Framework**: Utilizing large language models (LLMs) to integrate text from web pages, synthetic subtitles, and detection label information to generate more accurate and semantically rich descriptions. - **RWKV-CLIP Model**: The first vision-language representation learning model based on the RWKV architecture, which integrates the efficient parallel training of Transformers and the efficient inference of RNNs, significantly enhancing the performance of downstream tasks, including linear probing, zero-shot classification, and zero-shot image-text retrieval.

RWKV-CLIP: A Robust Vision-Language Representation Learner

How Much Can CLIP Benefit Vision-and-Language Tasks?

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

SpaceCLIP: A Vision-Language Pretraining Framework With Spatial Reconstruction On Text

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

CLIPVQA:Video Quality Assessment via CLIP

VisualRWKV: Exploring Recurrent Neural Networks for Visual Language Models

LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

RLEG: Vision-Language Representation Learning with Diffusion-based Embedding Generation