STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

Xiaoxiao Ma,Mohan Zhou,Tao Liang,Yalong Bai,Tiejun Zhao,Huaian Chen,Yi Jin

2024-06-16

Abstract:We present STAR, a text-to-image model that employs scale-wise auto-regressive paradigm. Unlike VAR, which is limited to class-conditioned synthesis within a fixed set of predetermined categories, our STAR enables text-driven open-set generation through three key designs: To boost diversity and generalizability with unseen combinations of objects and concepts, we introduce a pre-trained text encoder to extract representations for textual constraints, which we then use as guidance. To improve the interactions between generated images and fine-grained textual guidance, making results more controllable, additional cross-attention layers are incorporated at each scale. Given the natural structure correlation across different scales, we leverage 2D Rotary Positional Encoding (RoPE) and tweak it into a normalized version. This ensures consistent interpretation of relative positions across token maps at different scales and stabilizes the training process. Extensive experiments demonstrate that STAR surpasses existing benchmarks in terms of fidelity,image text consistency, and aesthetic quality. Our findings emphasize the potential of auto-regressive methods in the field of high-quality image synthesis, offering promising new directions for the T2I field currently dominated by diffusion methods.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in the field of Text-to-Image (T2I) generation: 1. **Open Set Generation**: Existing class-conditional synthesis methods (e.g., V AR) are limited to generation tasks within predefined category sets, which restricts diversity and the ability to generalize to new objects or concepts. The paper proposes a new autoregressive model, STAR, which achieves open set generation by introducing a pre-trained text encoder to extract text constraint representations. 2. **Fine-Grained Control**: To enhance the interactivity between generated images and text prompts, making the results more controllable, STAR incorporates additional cross-attention layers at each scale. This allows the model to better understand detailed information in complex scenes, such as the relationships and counts of multiple objects. 3. **Positional Encoding Improvement**: The paper proposes a normalized rotational positional encoding (Normalized RoPE) to address the issues of parameter redundancy and optimization difficulty in traditional absolute positional encodings (APEs) at different scales. Normalized RoPE ensures consistent interpretation across scales and stabilizes the training process. Through the above designs, STAR surpasses existing methods in multiple benchmarks, demonstrating excellent performance in fidelity (FID), text-image consistency (CLIP-Score), and aesthetic quality, while significantly reducing inference time. These improvements provide new directions for high-quality image synthesis, showcasing the great potential of autoregressive methods in the T2I field, which currently relies heavily on diffusion models.

STAR: Scale-wise Text-to-image generation via Auto-Regressive representations

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Emage: Non-Autoregressive Text-to-Image Generation

Chasing Consistency in Text-to-3D Generation from a Single Image.

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

STAR: Skeleton-aware Text-based 4D Avatar Generation with In-Network Motion Retargeting

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model

CART: Compositional Auto-Regressive Transformer for Image Generation

Intelligent Typography: Artistic Text Style Transfer for Complex Texture and Structure

Text-to-Image Synthesis: A Decade Survey

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Improving Text Generation on Images with Synthetic Captions

STAR: A Structure and Texture Aware Retinex Model.

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation