A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Liang Chen,Sinan Tan,Zefan Cai,Weichu Xie,Haozhe Zhao,Yichi Zhang,Junyang Lin,Jinze Bai,Tianyu Liu,Baobao Chang

2024-10-03

Abstract:This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at <a class="link-external link-https" href="https://github.com/chenllliang/DnD-Transformer" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the bottleneck of information loss in vector quantization (VQ) autoregressive image generation. Specifically, traditional 1D autoregressive models face two main challenges when generating high-quality images: 1. **Information Loss**: During the vector quantization process, especially when using VQ-VAE, significant information loss is introduced. For example, in a typical configuration (N=8192, f=16), the information compression ratio (ICR) is only 0.21%, far lower than the 8.3% of Stable Diffusion's VAE, which limits the reconstruction of fine-grained details. 2. **Increased Computational Resource Demand**: To generate higher quality images, increasing the size of the latent space (N) or reducing the downsampling factor (f) leads to a substantial increase in computational resource demand, which may result in codebook collapse or higher computational complexity. To address these issues, the authors propose a new model architecture—2D Autoregressive Transformer (DnD-Transformer). This model introduces a new autoregressive direction (depth direction) to predict more image codes, thereby improving the quality of image generation without increasing the overall computational budget. The DnD-Transformer is capable of generating higher resolution and more fine-grained images, and it performs well in generating images containing rich text and graphic elements, showing initial signs of visual-language intelligence.

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Emage: Non-Autoregressive Text-to-Image Generation

Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Autoregressive Image Generation using Residual Quantization

Not All Images are Worth 16x16 Words: Dynamic Transformers for Efficient Image Recognition

Not All Images Are Worth 16x16 Words: Dynamic Vision Transformers with Adaptive Sequence Length

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Autoregressive Image Generation without Vector Quantization

Randomized Autoregressive Visual Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation

Draft-and-Revise: Effective Image Generation with Contextual RQ-Transformer

Exploring Vision Transformers as Diffusion Learners

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

DiffiT: Diffusion Vision Transformers for Image Generation

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis