Scalable Autoregressive Image Generation with Mamba

Haopeng Li,Jinyue Yang,Kexin Wang,Xuerui Qiu,Yuhong Chou,Xin Li,Guoqi Li
2024-08-22
Abstract:We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at <a class="link-external link-https" href="https://github.com/hp-l33/AiM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the generation quality and inference speed of autoregressive (AR) image generation models. Specifically, the paper proposes a new autoregressive image generation model based on the Mamba architecture - AiM (Autoregressive image generation with Mamba). Mamba is a state - space model (SSM) with linear - time complexity, which is especially suitable for handling long - sequence modeling tasks. ### Main Problems and Solutions 1. **Generation Quality and Inference Speed**: - Existing autoregressive image generation models usually use the Transformer architecture. Although the generation quality is high, the computational complexity is quadratic, resulting in slow inference speed. - By introducing the Mamba architecture, AiM not only improves the generation quality but also significantly enhances the inference speed, which is 2 to 10 times faster than existing Transformer - based AR models. 2. **Adapting to Two - Dimensional Signals**: - Traditional Mamba models are mainly used for one - dimensional sequence data, such as text. To apply it to two - dimensional image data, existing methods usually adopt a multi - directional scanning strategy, which increases the number of parameters and computational cost. - AiM directly uses the "next - token prediction" paradigm for autoregressive image generation, avoiding complex modifications, thus maintaining the core structure of Mamba and its efficient long - sequence modeling ability. 3. **Balance between Parameter Efficiency and Performance**: - In order to achieve a balance between the number of parameters and performance, AiM introduces a new adaptive layer normalization method - adaLN - Group. This method divides the layers into several groups, each group sharing local parameters while retaining the specific biases of each layer, thereby optimizing memory usage without significantly sacrificing performance. ### Experimental Results - In the ImageNet1K 256×256 benchmark test, the best model of AiM achieved an FID (Fréchet Inception Distance) of 2.21, surpassing all Transformer - based AR models with comparable numbers of parameters, and showing significant advantages in inference speed. - The smallest - scale AiM model achieved an FID of 3.5 with only 148M parameters, outperforming other models that require more than twice the number of parameters to achieve similar results. ### Summary The main contributions of the paper are as follows: 1. Proposed the first autoregressive image generation model AiM based on the Mamba architecture, achieving high - quality and efficient class - conditional image generation. 2. Optimized the model's performance in visual generation tasks by introducing position encoding and the adaLN - Group method. 3. In the ImageNet 256×256 benchmark test, AiM demonstrated superior performance and fast inference speed, proving its efficiency and scalability. Through these improvements, AiM not only reaches a new level in generation quality but also surpasses existing models in inference speed, bringing important progress to the field of autoregressive image generation.