MambaMIM: Pre-training Mamba with State Space Token-interpolation
Fenghe Tang,Bingkun Nian,Yingtai Li,Jie Yang,Liu Wei,S. Kevin Zhou
2024-08-15
Abstract:Generative self-supervised learning demonstrates outstanding representation learning capabilities in both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). However, there are currently no generative pre-training methods related to selective state space models (Mamba) that can handle long-range dependencies effectively. To address this challenge, we introduce a generative self-supervised learning method for Mamba (MambaMIM) based on Selective Structure State Space Sequence Token-interpolation (S6T), a general-purpose pre-training method for arbitrary Mamba architectures. Our method, MambaMIM, incorporates a bottom-up 3D hybrid masking strategy in the encoder to maintain masking consistency across different architectures. Additionally, S6T is employed to learn causal relationships between the masked sequence in the state space. MambaMIM can be used on any single or hybrid Mamba architectures to enhance the Mamba long-range representation capability. Extensive downstream experiments reveal the feasibility and advancement of using Mamba for pre-training medical image tasks. The code is available at: <a class="link-external link-https" href="https://github.com/FengheTan9/MambaMIM" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper
The paper primarily addresses the following issues:
1. **Long-range Dependency Modeling**: To address the shortcomings of existing selective state space models (such as Mamba) in handling long-range dependencies, a new generative self-supervised learning method called MambaMIM is proposed. MambaMIM is based on Selective Structured State Space Sequence Interpolation (S6T) and can effectively handle long-range dependencies in different Mamba architectures.
2. **Medical Image Pre-training**: MambaMIM is pre-trained on a large-scale 3D CT dataset and its performance is validated on downstream medical image segmentation tasks. Experimental results show that MambaMIM significantly outperforms other advanced self-supervised pre-training methods in various medical image segmentation tasks.
3. **Consistency of Hybrid Architectures**: To ensure mask consistency between CNN and Mamba layers, the paper proposes a bottom-up hybrid masking strategy. This strategy helps maintain consistent masking operations during end-to-end training, thereby improving the effectiveness of representation learning.
In summary, the paper aims to enhance the performance of Mamba models in medical image segmentation tasks through the MambaMIM method and validates its superior performance across multiple datasets.