Abstract:Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the trade - off between reconstruction quality and generation quality in image tokenizers in image generation. Specifically: 1. **The relationship between increasing token length and performance**: Generally, increasing the token length can improve the quality of image reconstruction. However, an overly long token sequence will lead to a poorer generation effect in the autoregressive model (AR model), because a longer sequence requires a larger model capacity, more training costs, and is prone to error propagation. 2. **The balance between semantic information and detail information**: Existing methods either retain too many pixel - level details, resulting in a large number of tokens, or overly compress information, leading to semantic loss and detail loss. Therefore, how to reduce the token length and improve the generation efficiency while maintaining high - quality reconstruction is a challenge. To solve these problems, the author proposes a new image tokenizer named **ImageFolder**, whose main features include: - **Two - branch product quantization**: Capture different information of the image through two branches respectively. One branch is used to capture semantic information, and the other branch is used to capture pixel - level details. This makes it possible to enhance the representation ability without increasing the token length. - **Folded tokens**: In the autoregressive modeling process, two tokens can be predicted in parallel from one logit by folding tokens, thereby significantly shortening the sequence length and improving the generation efficiency. - **Semantic regularization**: Introduce a semantic regularization term in the semantic branch to ensure that the tokens can capture compact semantic information without ignoring high - frequency details. - **Quantizer dropout**: By randomly discarding some quantizers in multi - scale residual quantization, each residual layer can represent the image at different bit rates, thereby compensating for the unmodeled dependencies in the subsequent autoregressive modeling. These designs enable ImageFolder to shorten the token length and improve the generation efficiency while maintaining high - quality reconstruction. Experimental results show that ImageFolder outperforms existing methods on multiple metrics, especially when the token length is relatively short.

ImageFolder: Autoregressive Image Generation with Folded Tokens

Adaptive Length Image Tokenization via Recurrent Allocation

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Emage: Non-Autoregressive Text-to-Image Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Image Understanding Makes for A Good Tokenizer for Image Generation

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Autoregressive Image Generation without Vector Quantization

Factorized Visual Tokenization and Generation

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Language-Guided Image Tokenization for Generation

Randomized Autoregressive Visual Generation

Make A Long Image Short: Adaptive Token Length for Vision Transformers

3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation

Regularized Vector Quantization for Tokenized Image Synthesis

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Spectral Image Tokenizer