MambaMIM: Pre-training Mamba with State Space Token-interpolation

Fenghe Tang,Bingkun Nian,Yingtai Li,Jie Yang,Liu Wei,S. Kevin Zhou

2024-08-15

Abstract:Generative self-supervised learning demonstrates outstanding representation learning capabilities in both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). However, there are currently no generative pre-training methods related to selective state space models (Mamba) that can handle long-range dependencies effectively. To address this challenge, we introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T), a general-purpose pre-training method for arbitrary Mamba architectures. Our method, MambaMIM, incorporates a bottom-up 3D hybrid masking strategy in the encoder to maintain masking consistency across different architectures. Additionally, S6T is employed to learn causal relationships between the masked sequence in the state space. MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for pre-training medical image tasks. The code is available at: <a class="link-external link-https" href="https://github.com/FengheTan9/MambaMIM" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses the following issues: 1. **Long-range Dependency Modeling**: To address the shortcomings of existing selective state space models (such as Mamba) in handling long-range dependencies, a new generative self-supervised learning method called MambaMIM is proposed. MambaMIM is based on Selective Structured State Space Sequence Interpolation (S6T) and can effectively handle long-range dependencies in different Mamba architectures. 2. **Medical Image Pre-training**: MambaMIM is pre-trained on a large-scale 3D CT dataset and its performance is validated on downstream medical image segmentation tasks. Experimental results show that MambaMIM significantly outperforms other advanced self-supervised pre-training methods in various medical image segmentation tasks. 3. **Consistency of Hybrid Architectures**: To ensure mask consistency between CNN and Mamba layers, the paper proposes a bottom-up hybrid masking strategy. This strategy helps maintain consistent masking operations during end-to-end training, thereby improving the effectiveness of representation learning. In summary, the paper aims to enhance the performance of Mamba models in medical image segmentation tasks through the MambaMIM method and validates its superior performance across multiple datasets.

MambaMIM: Pre-training Mamba with State Space Token-interpolation

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

Autoregressive Pretraining with Mamba in Vision

Mamba-ND: Selective State Space Modeling for Multi-Dimensional Data

MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

MaskMamba: A Hybrid Mamba-Transformer Model for Masked Image Generation

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

U-Mamba: Enhancing Long-range Dependency for Biomedical Image Segmentation

A Survey on Visual Mamba

I2I-Mamba: Multi-modal medical image synthesis via selective state space modeling

LPAM: A lightweight medical segmentation network based on Mamba improved by prompt attention

SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

MedMamba: Vision Mamba for Medical Image Classification