Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Zhuoyan Luo,Fengyuan Shi,Yixiao Ge,Yujiu Yang,Limin Wang,Ying Shan
2024-09-07
Abstract:We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1.5B. The Open-MAGVIT2 project produces an open-source replication of Google's MAGVIT-v2 tokenizer, a tokenizer with a super-large codebook (i.e., $2^{18}$ codes), and achieves the state-of-the-art reconstruction performance (1.17 rFID) on ImageNet $256 \times 256$. Furthermore, we explore its application in plain auto-regressive models and validate scalability properties. To assist auto-regressive models in predicting with a super-large vocabulary, we factorize it into two sub-vocabulary of different sizes by asymmetric token factorization, and further introduce "next sub-token prediction" to enhance sub-token interaction for better generation quality. We release all models and codes to foster innovation and creativity in the field of auto-regressive visual generation.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to advance the development of autoregressive image generation technology. Specifically, the paper addresses the following issues: 1. **Reproduction of Visual Tokenizer**: The paper re-implements the advanced Lookup-Free Quantizer proposed in MAGVIT-v2 and achieves reconstruction performance on the ImageNet dataset comparable to MAGVIT-v2. This is accomplished by using an ultra-large codebook (2^18 codes), which is a significant achievement for the academic community. 2. **Autoregressive Generation with Ultra-Large Codebook**: To assist autoregressive models in predicting within an ultra-large vocabulary, the authors introduce an asymmetric token factorization technique and propose the "next sub-token prediction" method to enhance interactions between sub-tokens, thereby improving generation quality. Through these improvements, the paper demonstrates that with a powerful tokenizer, simple autoregressive models exhibit superiority and scalability on the standard ImageNet dataset. Experimental results show that Open-MAGVIT2 outperforms previous methods on multiple metrics, particularly excelling in image generation tasks at 256×256 resolution.