ImageFolder: Autoregressive Image Generation with Folded Tokens

Xiang Li,Kai Qiu,Hao Chen,Jason Kuen,Jiuxiang Gu,Bhiksha Raj,Zhe Lin
2024-10-16
Abstract:Image tokenizers are crucial for visual generative models, e.g., diffusion models (DMs) and autoregressive (AR) models, as they construct the latent representation for modeling. Increasing token length is a common approach to improve the image reconstruction quality. However, tokenizers with longer token lengths are not guaranteed to achieve better generation quality. There exists a trade-off between reconstruction and generation quality regarding token length. In this paper, we investigate the impact of token length on both image reconstruction and generation and provide a flexible solution to the tradeoff. We propose ImageFolder, a semantic tokenizer that provides spatially aligned image tokens that can be folded during autoregressive modeling to improve both generation efficiency and quality. To enhance the representative capability without increasing token length, we leverage dual-branch product quantization to capture different contexts of images. Specifically, semantic regularization is introduced in one branch to encourage compacted semantic information while another branch is designed to capture the remaining pixel-level details. Extensive experiments demonstrate the superior quality of image generation and shorter token length with ImageFolder tokenizer.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the trade - off between reconstruction quality and generation quality in image tokenizers in image generation. Specifically: 1. **The relationship between increasing token length and performance**: Generally, increasing the token length can improve the quality of image reconstruction. However, an overly long token sequence will lead to a poorer generation effect in the autoregressive model (AR model), because a longer sequence requires a larger model capacity, more training costs, and is prone to error propagation. 2. **The balance between semantic information and detail information**: Existing methods either retain too many pixel - level details, resulting in a large number of tokens, or overly compress information, leading to semantic loss and detail loss. Therefore, how to reduce the token length and improve the generation efficiency while maintaining high - quality reconstruction is a challenge. To solve these problems, the author proposes a new image tokenizer named **ImageFolder**, whose main features include: - **Two - branch product quantization**: Capture different information of the image through two branches respectively. One branch is used to capture semantic information, and the other branch is used to capture pixel - level details. This makes it possible to enhance the representation ability without increasing the token length. - **Folded tokens**: In the autoregressive modeling process, two tokens can be predicted in parallel from one logit by folding tokens, thereby significantly shortening the sequence length and improving the generation efficiency. - **Semantic regularization**: Introduce a semantic regularization term in the semantic branch to ensure that the tokens can capture compact semantic information without ignoring high - frequency details. - **Quantizer dropout**: By randomly discarding some quantizers in multi - scale residual quantization, each residual layer can represent the image at different bit rates, thereby compensating for the unmodeled dependencies in the subsequent autoregressive modeling. These designs enable ImageFolder to shorten the token length and improve the generation efficiency while maintaining high - quality reconstruction. Experimental results show that ImageFolder outperforms existing methods on multiple metrics, especially when the token length is relatively short.