Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Fengyuan Shi,Zhuoyan Luo,Yixiao Ge,Yujiu Yang,Ying Shan,Limin Wang
2024-12-04
Abstract:Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction ($1.00$ rFID) and autoregressive visual generation ($2.05$ gFID). The code and models are available at <a class="link-external link-https" href="https://github.com/TencentARC/SEED-Voken" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the codebook collapse problem in large - scale visual generation tasks caused by the partial update strategy of the codebook in existing vector quantization methods. Specifically, as the codebook size and embedding dimension increase, the distribution gap between non - active codes and the visual encoder representation space gradually widens, leading to a decline in codebook utilization and affecting the quality and stability of visual generation. To solve this problem, the paper proposes a new vector quantization method - Index Backpropagation Quantization (IBQ), which globally updates the entire codebook during each backpropagation process to ensure the distribution consistency between the codebook and the visual encoder, thereby achieving high - utilization large - scale visual tokenizer training. ### Main Contributions 1. **Propose IBQ**: A new vector quantization method that ensures the distribution consistency between the codebook and the visual encoder by globally updating all codes in the codebook, thereby avoiding the codebook collapse problem. 2. **Study the Scalability of IBQ**: By increasing the codebook size, code dimension, and model size, the advantages of IBQ in terms of scalability are verified. IBQ has achieved for the first time an extremely large codebook (262,144 codes) with high utilization and high dimension (256 - dimensional). 3. **Construct a Series of Autoregressive Image Generation Models**: Based on the IBQ tokenizer, autoregressive image generation models with parameters ranging from 300M to 2.1B are constructed, and their performance is significantly better than existing methods. ### Solutions - **Index Backpropagation Quantization (IBQ)**: By updating all codes in all codebooks during each backpropagation process instead of only updating selected codes, the distribution consistency between the codebook and visual features is ensured. Specifically, a straight - through estimator is applied between visual features and all codebook embeddings, making all codes differentiable. - **Dual - Quantization Loss**: The dual - quantization loss is introduced to force the selected code embeddings and given visual features to be close to each other, further improving the quantization accuracy. - **Model Expansion**: By increasing the number of layers and code dimension of the model, the performance improvement of IBQ at different scales is verified. ### Experimental Results - **Reconstruction Performance**: On the ImageNet dataset, IBQ achieves an rFID of 1.37 in the case of 16,384 codes and 256 - dimensional, which is significantly better than other methods. When the codebook size is increased to 262,144, the rFID of IBQ is further reduced to 1.00. - **Generation Performance**: In the image generation task, the autoregressive models based on IBQ perform well at different scales. In particular, when the parameter is 2.1B, the gFID reaches 2.05 and the IS reaches 286.73, which are better than existing diffusion models and other autoregressive model variants. ### Conclusion The paper solves the codebook collapse problem of existing vector quantization methods in large - scale visual generation tasks by proposing the IBQ method and achieves high - utilization large - scale visual tokenizer training. The experimental results show that IBQ performs well in both reconstruction and generation tasks and has good scalability and robustness.