Abstract:Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

LabCLIP: Label-Enhanced Clip for Improving Zero-Shot Text Classification.

CLIPText: A New Paradigm for Zero-shot Text Classification.

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

ZegCLIP: Towards Adapting CLIP for Zero-shot Semantic Segmentation

Online Zero-Shot Classification with CLIP

Transductive Zero-Shot and Few-Shot CLIP

Application of CLIP for Efficient Zero-Shot Learning

Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation

Language-Driven Cross-Modal Classifier for Zero-Shot Multi-Label Image Recognition

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

SimCLIP: Refining Image-Text Alignment with Simple Prompts for Zero-/Few-shot Anomaly Detection

WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation

Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning

CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP