A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Liang Chen,Sinan Tan,Zefan Cai,Weichu Xie,Haozhe Zhao,Yichi Zhang,Junyang Lin,Jinze Bai,Tianyu Liu,Baobao Chang
2024-10-03
Abstract:This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer. The DnD-Transformer predicts more codes for an image by introducing a new autoregression direction, \textit{model depth}, along with the sequence length direction. Compared to traditional 1D autoregression and previous work utilizing similar 2D image decomposition such as RQ-Transformer, the DnD-Transformer is an end-to-end model that can generate higher quality images with the same backbone model size and sequence length, opening a new optimization perspective for autoregressive image generation. Furthermore, our experiments reveal that the DnD-Transformer's potential extends beyond generating natural images. It can even generate images with rich text and graphical elements in a self-supervised manner, demonstrating an understanding of these combined modalities. This has not been previously demonstrated for popular vision generative models such as diffusion models, showing a spark of vision-language intelligence when trained solely on images. Code, datasets and models are open at <a class="link-external link-https" href="https://github.com/chenllliang/DnD-Transformer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve The paper aims to address the bottleneck of information loss in vector quantization (VQ) autoregressive image generation. Specifically, traditional 1D autoregressive models face two main challenges when generating high-quality images: 1. **Information Loss**: During the vector quantization process, especially when using VQ-VAE, significant information loss is introduced. For example, in a typical configuration (N=8192, f=16), the information compression ratio (ICR) is only 0.21%, far lower than the 8.3% of Stable Diffusion's VAE, which limits the reconstruction of fine-grained details. 2. **Increased Computational Resource Demand**: To generate higher quality images, increasing the size of the latent space (N) or reducing the downsampling factor (f) leads to a substantial increase in computational resource demand, which may result in codebook collapse or higher computational complexity. To address these issues, the authors propose a new model architecture—2D Autoregressive Transformer (DnD-Transformer). This model introduces a new autoregressive direction (depth direction) to predict more image codes, thereby improving the quality of image generation without increasing the overall computational budget. The DnD-Transformer is capable of generating higher resolution and more fine-grained images, and it performs well in generating images containing rich text and graphic elements, showing initial signs of visual-language intelligence.