Abstract:We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba's core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at <a class="link-external link-https" href="https://github.com/hp-l33/AiM" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the generation quality and inference speed of autoregressive (AR) image generation models. Specifically, the paper proposes a new autoregressive image generation model based on the Mamba architecture - AiM (Autoregressive image generation with Mamba). Mamba is a state - space model (SSM) with linear - time complexity, which is especially suitable for handling long - sequence modeling tasks. ### Main Problems and Solutions 1. **Generation Quality and Inference Speed**: - Existing autoregressive image generation models usually use the Transformer architecture. Although the generation quality is high, the computational complexity is quadratic, resulting in slow inference speed. - By introducing the Mamba architecture, AiM not only improves the generation quality but also significantly enhances the inference speed, which is 2 to 10 times faster than existing Transformer - based AR models. 2. **Adapting to Two - Dimensional Signals**: - Traditional Mamba models are mainly used for one - dimensional sequence data, such as text. To apply it to two - dimensional image data, existing methods usually adopt a multi - directional scanning strategy, which increases the number of parameters and computational cost. - AiM directly uses the "next - token prediction" paradigm for autoregressive image generation, avoiding complex modifications, thus maintaining the core structure of Mamba and its efficient long - sequence modeling ability. 3. **Balance between Parameter Efficiency and Performance**: - In order to achieve a balance between the number of parameters and performance, AiM introduces a new adaptive layer normalization method - adaLN - Group. This method divides the layers into several groups, each group sharing local parameters while retaining the specific biases of each layer, thereby optimizing memory usage without significantly sacrificing performance. ### Experimental Results - In the ImageNet1K 256×256 benchmark test, the best model of AiM achieved an FID (Fréchet Inception Distance) of 2.21, surpassing all Transformer - based AR models with comparable numbers of parameters, and showing significant advantages in inference speed. - The smallest - scale AiM model achieved an FID of 3.5 with only 148M parameters, outperforming other models that require more than twice the number of parameters to achieve similar results. ### Summary The main contributions of the paper are as follows: 1. Proposed the first autoregressive image generation model AiM based on the Mamba architecture, achieving high - quality and efficient class - conditional image generation. 2. Optimized the model's performance in visual generation tasks by introducing position encoding and the adaLN - Group method. 3. In the ImageNet 256×256 benchmark test, AiM demonstrated superior performance and fast inference speed, proving its efficiency and scalability. Through these improvements, AiM not only reaches a new level in generation quality but also surpasses existing models in inference speed, bringing important progress to the field of autoregressive image generation.

Scalable Autoregressive Image Generation with Mamba

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

Autoregressive Pretraining with Mamba in Vision

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

MambaOut: Do We Really Need Mamba for Vision?

Mamba-R: Vision Mamba ALSO Needs Registers

MambaMIM: Pre-training Mamba with State Space Token-interpolation

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Hi-Mamba: Hierarchical Mamba for Efficient Image Super-Resolution

$\text{S}^{3}$Mamba: Arbitrary-Scale Super-Resolution via Scaleable State Space Model

A Survey on Visual Mamba

MambaIR: A Simple Baseline for Image Restoration with State-Space Model