Dimba: Transformer-Mamba Diffusion Models

Zhengcong Fei,Mingyuan Fan,Changqian Yu,Debang Li,Youqiang Zhang,Junshi Huang
2024-06-03
Abstract:This paper unveils Dimba, a new text-to-image diffusion model that employs a distinctive hybrid architecture combining Transformer and Mamba elements. Specifically, Dimba sequentially stacked blocks alternate between Transformer and Mamba layers, and integrate conditional information through the cross-attention layer, thus capitalizing on the advantages of both architectural paradigms. We investigate several optimization strategies, including quality tuning, resolution adaption, and identify critical configurations necessary for large-scale image generation. The model's flexible design supports scenarios that cater to specific resource constraints and objectives. When scaled appropriately, Dimba offers substantial throughput and a reduced memory footprint relative to conventional pure Transformers-based benchmarks. Extensive experiments indicate that Dimba achieves comparable performance compared with benchmarks in terms of image quality, artistic rendering, and semantic control. We also report several intriguing properties of architecture discovered during evaluation and release checkpoints in experiments. Our findings emphasize the promise of large-scale hybrid Transformer-Mamba architectures in the foundational stage of diffusion models, suggesting a bright future for text-to-image generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem this paper attempts to address is the excessive memory and computational resource consumption of existing text-to-image generation models when handling long contexts. Specifically, although existing CNN or Transformer-based diffusion models have made significant progress in text-to-image generation, their application scope is limited when dealing with long sequence data due to quadratic growth in memory caching and substantial consumption of computational resources. To overcome these challenges, the paper proposes the Dimba model, a new hybrid architecture that combines Transformer layers and Mamba layers (an advanced state space model), aiming to improve model performance and throughput while reducing memory usage. The Dimba model alternates stacking Transformer layers and Mamba layers and integrates conditional information through cross-attention layers, leveraging the advantages of both architectures. Additionally, the paper explores several optimization strategies, including quality tuning and resolution adaptation, to identify key configurations required for large-scale image generation. Experimental results show that the Dimba model is comparable to existing benchmark models in terms of image quality, artistic rendering, and semantic control, and it performs exceptionally well, especially when dealing with specific resource constraints and target scenarios.