Abstract:Existing vector quantization (VQ) methods struggle with scalability, largely attributed to the instability of the codebook that undergoes partial updates during training. The codebook is prone to collapse as utilization decreases, due to the progressively widening distribution gap between non-activated codes and visual features. To solve the problem, we propose Index Backpropagation Quantization (IBQ), a new VQ method for the joint optimization of all codebook embeddings and the visual encoder. Applying a straight-through estimator on the one-hot categorical distribution between the encoded feature and codebook, all codes are differentiable and maintain a consistent latent space with the visual encoder. IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook ($2^{18}$) with high dimension ($256$) and high utilization. Experiments on the standard ImageNet benchmark demonstrate the scalability and superiority of IBQ, achieving competitive results on both reconstruction ($1.00$ rFID) and autoregressive visual generation ($2.05$ gFID). The code and models are available at <a class="link-external link-https" href="https://github.com/TencentARC/SEED-Voken" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the codebook collapse problem in large - scale visual generation tasks caused by the partial update strategy of the codebook in existing vector quantization methods. Specifically, as the codebook size and embedding dimension increase, the distribution gap between non - active codes and the visual encoder representation space gradually widens, leading to a decline in codebook utilization and affecting the quality and stability of visual generation. To solve this problem, the paper proposes a new vector quantization method - Index Backpropagation Quantization (IBQ), which globally updates the entire codebook during each backpropagation process to ensure the distribution consistency between the codebook and the visual encoder, thereby achieving high - utilization large - scale visual tokenizer training. ### Main Contributions 1. **Propose IBQ**: A new vector quantization method that ensures the distribution consistency between the codebook and the visual encoder by globally updating all codes in the codebook, thereby avoiding the codebook collapse problem. 2. **Study the Scalability of IBQ**: By increasing the codebook size, code dimension, and model size, the advantages of IBQ in terms of scalability are verified. IBQ has achieved for the first time an extremely large codebook (262,144 codes) with high utilization and high dimension (256 - dimensional). 3. **Construct a Series of Autoregressive Image Generation Models**: Based on the IBQ tokenizer, autoregressive image generation models with parameters ranging from 300M to 2.1B are constructed, and their performance is significantly better than existing methods. ### Solutions - **Index Backpropagation Quantization (IBQ)**: By updating all codes in all codebooks during each backpropagation process instead of only updating selected codes, the distribution consistency between the codebook and visual features is ensured. Specifically, a straight - through estimator is applied between visual features and all codebook embeddings, making all codes differentiable. - **Dual - Quantization Loss**: The dual - quantization loss is introduced to force the selected code embeddings and given visual features to be close to each other, further improving the quantization accuracy. - **Model Expansion**: By increasing the number of layers and code dimension of the model, the performance improvement of IBQ at different scales is verified. ### Experimental Results - **Reconstruction Performance**: On the ImageNet dataset, IBQ achieves an rFID of 1.37 in the case of 16,384 codes and 256 - dimensional, which is significantly better than other methods. When the codebook size is increased to 262,144, the rFID of IBQ is further reduced to 1.00. - **Generation Performance**: In the image generation task, the autoregressive models based on IBQ perform well at different scales. In particular, when the parameter is 2.1B, the gFID reaches 2.05 and the IS reaches 286.73, which are better than existing diffusion models and other autoregressive model variants. ### Conclusion The paper solves the codebook collapse problem of existing vector quantization methods in large - scale visual generation tasks by proposing the IBQ method and achieves high - utilization large - scale visual tokenizer training. The experimental results show that IBQ performs well in both reconstruction and generation tasks and has good scalability and robustness.

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Resizing codebook of vector quantization without retraining

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Factorized Visual Tokenization and Generation

Image Understanding Makes for A Good Tokenizer for Image Generation

SoftVQ-VAE: Efficient 1-Dimensional Continuous Tokenizer

Image and Video Tokenization with Binary Spherical Quantization

Scaling Image Tokenizers with Grouped Spherical Quantization

Regularized Vector Quantization for Tokenized Image Synthesis

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Deep Recurrent Quantization for Generating Sequential Binary Codes

ADFQ-ViT: Activation-Distribution-Friendly Post-Training Quantization for Vision Transformers

LibVQ: A Toolkit for Optimizing Vector Quantization and Efficient Neural Retrieval.

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Vector Quantization with Self-Attention for Quality-Independent Representation Learning

SweetTokenizer: Semantic-Aware Spatial-Temporal Tokenizer for Compact Visual Discretization

Autoregressive Image Generation without Vector Quantization