Abstract:In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.

What problem does this paper attempt to address?

This paper proposes a new method for generating images based on wavelet encoding. Although traditional methods such as diffusion models can generate high-resolution images, the authors believe that this process can be improved. They adopt wavelet image encoding and segment the visual information in a coarse-to-fine order, requiring only 7 tokens. The second key point is to modify a language transformer to adapt to this "wavelet language" token sequence and learn the statistical correlations in it, which reflect the known correlations between wavelet subbands at different resolutions. During the generation process, the authors use progressive wavelet compression technology to create a token sequence, where each token corresponds to the next most important visual information segment. This approach allows for flexible sequence lengths to control the level of detail or resolution of the image. By guiding the generation process (e.g., with class attribution or text prompts), conditional image generation can be easily achieved, and simple transformation inference techniques can be used to achieve random control, thereby producing diversified images from the same text prompt. The paper also discusses how to leverage the local support properties of wavelets to modify the guidance during the generation process, in order to apply different prompts to different regions. Compared to other methods such as VQGAN and DALL-E, their approach uses fewer tokens (only 7), but can generate images of arbitrary resolution and fidelity. The paper reviews related work, including diffusion models, autoregressive models, and methods that use wavelets as the basis for frequency decomposition. The authors also provide a detailed introduction to the elements of wavelet image encoding, such as wavelet transformation, embedded wavelet segmentation process, and how to decode the token sequence back to approximate wavelet representation. Finally, the paper describes how they modified the DistilGPT2 model to adapt to the wavelet image generation task and provides experimental results to validate the effectiveness of the method. Through this approach, they are able to leverage the structural and statistical properties of wavelet encoding to generate visually meaningful images.

Wavelets Are All You Need for Autoregressive Image Generation

Spectral Image Tokenizer

CART: Compositional Auto-Regressive Transformer for Image Generation

Generalized Rectifier Wavelet Covariance Models For Texture Synthesis

Autoregressive Image Generation using Residual Quantization

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Autoregressive Image Generation without Vector Quantization

Randomized Autoregressive Visual Generation

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Wavelet-Based Image Tokenizer for Vision Transformers

A Novel Embedded Coding Algorithm Based on the Reconstructed DCT Coefficients.

3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Joint Wavelet and Spatial Transformation for Digital Watermarking

Image coding using wavelet transform

Wavelet-Packets for Deepfake Image Analysis and Detection

Wavelet Networks: Scale-Translation Equivariant Learning From Raw Time-Series

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Time Series Synthesis via Multi-scale Patch-based Generation of Wavelet Scalogram

Wavelets to the Rescue: Improving Sample Quality of Latent Variable Deep Generative Models

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Inducing wavelets into random fields via generative boosting