Abstract:VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP's proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are <a class="link-external link-https" href="https://github.com/daixiangzi/VAR-CLIP" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper proposes a new text-to-image generation model named VAR-CLIP, aimed at addressing the issue of high-quality text-to-image (T2I) generation. Specifically, VAR-CLIP combines Visual Auto-Regressive (VAR) technology and Contrastive Language–Image Pre-training (CLIP) to achieve efficient and high-quality image generation. ### Main Contributions 1. **Proposing the VAR-CLIP Framework**: This is a framework for high-quality text-to-image generation that uses CLIP to obtain text embeddings as conditions for VAR to generate images. This allows for the generation of images that are highly consistent with the text description, of high quality, and aesthetically excellent. 2. **Constructing a Large-Scale Image-Text Dataset**: To support T2I tasks on large-scale datasets such as ImageNet, the authors used BLIP-2 to construct a large-scale image-text pair dataset. 3. **Exploring the Importance of Word Order in CLIP**: The study found that in CLIP, the position of words has a significant impact on the generation results, especially the first 20 words (excluding the start and end tokens) contribute more to the overall description. ### Method Overview - **Pre-trained Text Encoder**: Using CLIP to convert input text into embedding representations, serving as conditions during the generation process. - **Multi-Scale Image Tokenizer**: Converting images into discrete multi-scale tokens through a Multi-Scale Vector Quantized Autoencoder (Multi-Scale VQV AE) for efficient generation. - **Conditional Visual Auto-Regressive Transformer**: Predicting the "next scale" token of the image based on the condition of CLIP text embeddings, rather than the traditional "next token." - **Two-Stage Training Strategy**: First, training the multi-scale VQV AE independently, and then training the conditional visual auto-regressive transformer based on it. ### Experimental Results - VAR-CLIP can generate high-quality images based on diverse text prompts, including animals, architectural structures, and landscapes, and can capture the time and lighting conditions described in the text well. - Nevertheless, VAR-CLIP also produces failure cases with obvious artifacts in some situations, especially when dealing with details like animal eyes. ### Conclusion The paper concludes that although VAR-CLIP has made significant progress in text-to-image generation, there are still some limitations, such as precise title generation and alignment issues between the text encoding model and the image generation process. Future research will focus on improving title quality and enhancing the alignment between complex text and visual elements.

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

ControlVAR: Exploring Controllable Visual Autoregressive Modeling

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

CgT-GAN: CLIP-guided Text GAN for Image Captioning

RWKV-CLIP: A Robust Vision-Language Representation Learner

VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

How Much Can CLIP Benefit Vision-and-Language Tasks?

LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization

CLIP4Caption: CLIP for Video Caption

CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model

CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis

CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations