Abstract:Contrastive Language-Image Pre-training (CLIP) has been the cornerstone for zero-shot classification, text-image retrieval, and text-image generation by aligning image and text modalities. Despite its widespread adoption, a significant limitation of CLIP lies in the inadequate length of text input. The length of the text token is restricted to 77, and an empirical study shows the actual effective length is even less than 20. This prevents CLIP from handling detailed descriptions, limiting its applications for image retrieval and text-to-image generation with extensive prerequisites. To this end, we propose Long-CLIP as a plug-and-play alternative to CLIP that supports long-text input, retains or even surpasses its zero-shot generalizability, and aligns the CLIP latent space, making it readily replace CLIP without any further adaptation in downstream frameworks. Nevertheless, achieving this goal is far from straightforward, as simplistic fine-tuning can result in a significant degradation of CLIP's performance. Moreover, substituting the text encoder with a language model supporting longer contexts necessitates pretraining with vast amounts of data, incurring significant expenses. Accordingly, Long-CLIP introduces an efficient fine-tuning solution on CLIP with two novel strategies designed to maintain the original capabilities, including (1) a knowledge-preserved stretching of positional embedding and (2) a primary component matching of CLIP features. With leveraging just one million extra long text-image pairs, Long-CLIP has shown the superiority to CLIP for about 20% in long caption text-image retrieval and 6% in traditional text-image retrieval tasks, e.g., COCO and Flickr30k. Furthermore, Long-CLIP offers enhanced capabilities for generating images from detailed text descriptions by replacing CLIP in a plug-and-play manner.

CogLTX: Applying BERT to Long Texts.

Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Long Text Classification Based on BERT

LordBERT: Embedding Long Text by Segment Ordering with BERT

Long-CLIP: Unlocking the Long-Text Capability of CLIP

Long Text Classification with Segmentation

CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought

Improving the BERT model for long text sequences in question answering domain

Hierarchical Transformers for Long Document Classification

FAT ALBERT: Finding Answers in Large Texts using Semantic Similarity Attention Layer based on BERT

SEGMENT+: Long Text Processing with Short-Context Language Models

TTG-Text: A Graph-Based Text Representation Framework Enhanced by Typical Testors for Improved Classification

LongCoder: A Long-Range Pre-trained Language Model for Code Completion

Blockwise Self-Attention for Long Document Understanding

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Global Semantic Information Extraction Model for Chinese long text classification based on fine-tune BERT

Fixed global memory for controllable long text generation

mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences

No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Understanding Long Programming Languages with Structure-Aware Sparse Attention