Microscopic-Mamba: Revealing the Secrets of Microscopic Images with Just 4M Parameters

Shun Zou,Zhuo Zhang,Yi Zou,Guangwei Gao
2024-09-12
Abstract:In the field of medical microscopic image classification (MIC), CNN-based and Transformer-based models have been extensively studied. However, CNNs struggle with modeling long-range dependencies, limiting their ability to fully utilize semantic information in images. Conversely, Transformers are hampered by the complexity of quadratic computations. To address these challenges, we propose a model based on the Mamba architecture: Microscopic-Mamba. Specifically, we designed the Partially Selected Feed-Forward Network (PSFFN) to replace the last linear layer of the Visual State Space Module (VSSM), enhancing Mamba's local feature extraction capabilities. Additionally, we introduced the Modulation Interaction Feature Aggregation (MIFA) module to effectively modulate and dynamically aggregate global and local features. We also incorporated a parallel VSSM mechanism to improve inter-channel information interaction while reducing the number of parameters. Extensive experiments have demonstrated that our method achieves state-of-the-art performance on five public datasets. Code is available at <a class="link-external link-https" href="https://github.com/zs1314/Microscopic-Mamba" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of existing models in medical microscopic image classification (MIC). Specifically: 1. **Limitations of Convolutional Neural Networks (CNNs)**: - CNNs have difficulty in modeling long - range dependencies, which restricts their full utilization of image semantic information. - The local receptive fields of CNNs make it difficult for them to capture long - distance information. 2. **Limitations of Transformer Models**: - Although Transformers are good at global modeling, the time complexity of their self - attention mechanism is quadratic, resulting in an excessively high computational burden, especially when dealing with long sequences. - Such high computational complexity is an important issue in actual medical environments because these environments usually have strict computational resource limitations. 3. **Limitations of Methods Combining CNN and Transformer**: - Some studies have attempted to combine CNN and Transformer to reduce computational complexity, but this is often at the cost of sacrificing the ability of Transformer to capture global information. To address these challenges, the author proposes a new model based on the Mamba architecture: Microscopic - Mamba. This model aims to effectively capture global and local features while maintaining linear complexity. Specific improvements include: - **Partial Selection Feed - Forward Network (PSFFN)**: It is used to replace the last linear layer in the Visual State Space Module (VSSM) to enhance the local feature extraction ability. - **Modulation Interaction Feature Aggregation Module (MIFA)**: It effectively modulates and dynamically aggregates global and local features. - **Parallel VSSM Mechanism**: It improves the information interaction between channels while reducing the number of parameters. Through these improvements, the experimental results of Microscopic - Mamba on five public datasets show that it not only outperforms the existing state - of - the - art methods in performance, but also has fewer parameters and lower computational complexity.