VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

Qian Zhang,Xiangzi Dai,Ninghua Yang,Xiang An,Ziyong Feng,Xingyu Ren
2024-08-02
Abstract:VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust generalization. However, the original VAR model is constrained to class-conditioned synthesis, relying solely on textual captions for guidance. In this paper, we introduce VAR-CLIP, a novel text-to-image model that integrates Visual Auto-Regressive techniques with the capabilities of CLIP. The VAR-CLIP framework encodes captions into text embeddings, which are then utilized as textual conditions for image generation. To facilitate training on extensive datasets, such as ImageNet, we have constructed a substantial image-text dataset leveraging BLIP2. Furthermore, we delve into the significance of word positioning within CLIP for the purpose of caption guidance. Extensive experiments confirm VAR-CLIP's proficiency in generating fantasy images with high fidelity, textual congruence, and aesthetic excellence. Our project page are <a class="link-external link-https" href="https://github.com/daixiangzi/VAR-CLIP" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a new text-to-image generation model named VAR-CLIP, aimed at addressing the issue of high-quality text-to-image (T2I) generation. Specifically, VAR-CLIP combines Visual Auto-Regressive (VAR) technology and Contrastive Languageā€“Image Pre-training (CLIP) to achieve efficient and high-quality image generation. ### Main Contributions 1. **Proposing the VAR-CLIP Framework**: This is a framework for high-quality text-to-image generation that uses CLIP to obtain text embeddings as conditions for VAR to generate images. This allows for the generation of images that are highly consistent with the text description, of high quality, and aesthetically excellent. 2. **Constructing a Large-Scale Image-Text Dataset**: To support T2I tasks on large-scale datasets such as ImageNet, the authors used BLIP-2 to construct a large-scale image-text pair dataset. 3. **Exploring the Importance of Word Order in CLIP**: The study found that in CLIP, the position of words has a significant impact on the generation results, especially the first 20 words (excluding the start and end tokens) contribute more to the overall description. ### Method Overview - **Pre-trained Text Encoder**: Using CLIP to convert input text into embedding representations, serving as conditions during the generation process. - **Multi-Scale Image Tokenizer**: Converting images into discrete multi-scale tokens through a Multi-Scale Vector Quantized Autoencoder (Multi-Scale VQV AE) for efficient generation. - **Conditional Visual Auto-Regressive Transformer**: Predicting the "next scale" token of the image based on the condition of CLIP text embeddings, rather than the traditional "next token." - **Two-Stage Training Strategy**: First, training the multi-scale VQV AE independently, and then training the conditional visual auto-regressive transformer based on it. ### Experimental Results - VAR-CLIP can generate high-quality images based on diverse text prompts, including animals, architectural structures, and landscapes, and can capture the time and lighting conditions described in the text well. - Nevertheless, VAR-CLIP also produces failure cases with obvious artifacts in some situations, especially when dealing with details like animal eyes. ### Conclusion The paper concludes that although VAR-CLIP has made significant progress in text-to-image generation, there are still some limitations, such as precise title generation and alignment issues between the text encoding model and the image generation process. Future research will focus on improving title quality and enhancing the alignment between complex text and visual elements.