Xiang Li,Kai Qiu,Hao Chen,Jason Kuen,Jiuxiang Gu,Jindong Wang,Zhe Lin,Bhiksha Raj
Abstract:Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the quality of image reconstruction and generation through an efficient image tokenizer in image generation tasks. Specifically, the author proposes an open - source image tokenization framework named XQ - GAN, which aims to optimize the discrete representation of images by integrating multiple advanced quantization techniques, such as Vector Quantization (VQ), Residual Quantization (RQ), Multi - Scale Residual Quantization (MSRQ), Product Quantization (PQ), Lookup - Free Quantization (LFQ) and Binary Spherical Quantization (BSQ), as well as a highly flexible and customizable training environment, thereby improving the performance of subsequent generation models.
### Main Contribution Points:
1. **Comprehensive Quantization Technology Integration**: The XQ - GAN framework integrates multiple state - of - the - art quantization techniques, allowing users to select different combinations of quantization methods according to specific requirements, thus achieving better performance in image reconstruction and generation tasks.
2. **Flexibility and Customizability**: This framework provides a variety of choices for encoders, decoders, discriminator architectures and semantic alignment methods, and supports a modular combination method, enabling researchers to explore different design options.
3. **Pretrained Weights**: To support further research, the author provides model weights pretrained on multiple datasets, including ImageNet, LAION - 400M and IMed - 361M, which are convenient for the community to directly use for training subsequent generation models or for fine - tuning for specific tasks.
4. **Performance Advantage**: In the standard ImageNet 256×256 benchmark test, the rFID (reconstruction quality index) of XQ - GAN reaches 0.64, which is significantly better than the existing MAGVIT - v2 (0.9 rFID) and VAR (0.9 rFID). In addition, using XQ - GAN as a tokenizer can also improve the gFID (generation quality index). For example, under the same VAR architecture, the gFID of XQ - GAN+VAR reaches 2.6, which is significantly better than the 3.3 gFID of VAR.
### Key Technologies of the Solution:
- **Quantization Technologies**: Including VQ, RQ, MSRQ, PQ, LFQ and BSQ, these technologies respectively optimize the discrete representation of images in different dimensions.
- **Semantic Alignment**: By introducing the pretrained DINOv2 or CLIP model, rich semantic information is injected into the tokenized image representation, improving the semantic consistency and generation quality of the generation model.
- **Adversarial Discriminator**: It provides a variety of discriminator architectures (such as PatchGAN, StyleGAN and DINO) and loss function choices to ensure that the generated images have high authenticity in both local details and overall structure.
### Summary:
The XQ - GAN framework significantly improves the performance of image reconstruction and generation tasks by integrating multiple advanced quantization techniques and a flexible training environment. This framework not only achieves innovation in technology, but also provides powerful tools and support for the research community, promoting the further development of the image generation field.