Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei,Mingyuan Fan,Changqian Yu,Debang Li,Youqiang Zhang,Junshi Huang

2024-06-03

Abstract:This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main problem this paper attempts to address is the excessive memory and computational resource consumption of existing text-to-image generation models when handling long contexts. Specifically, although existing CNN or Transformer-based diffusion models have made significant progress in text-to-image generation, their application scope is limited when dealing with long sequence data due to quadratic growth in memory caching and substantial consumption of computational resources. To overcome these challenges, the paper proposes the Dimba model, a new hybrid architecture that combines Transformer layers and Mamba layers (an advanced state space model), aiming to improve model performance and throughput while reducing memory usage. The Dimba model alternates stacking Transformer layers and Mamba layers and integrates conditional information through cross-attention layers, leveraging the advantages of both architectures. Additionally, the paper explores several optimization strategies, including quality tuning and resolution adaptation, to identify key configurations required for large-scale image generation. Experimental results show that the Dimba model is comparable to existing benchmark models in terms of image quality, artistic rendering, and semantic control, and it performs exceptionally well, especially when dealing with specific resource constraints and target scenarios.

Dimba: Transformer-Mamba Diffusion Models

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

ZigMa: A DiT-style Zigzag Mamba Diffusion Model

A Survey of Mamba

Dynamic Diffusion Transformer

TerDiT: Ternary Diffusion Models with Transformers

MobileMamba: Lightweight Multi-Receptive Visual Mamba Network

DeMansia: Mamba Never Forgets Any Tokens

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs

Scalable Diffusion Models with Transformers

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

Soft Masked Mamba Diffusion Model for CT to MRI Conversion

Demystify Mamba in Vision: A Linear Attention Perspective

Scalable Autoregressive Image Generation with Mamba

Mamba-R: Vision Mamba ALSO Needs Registers