CAMS: Convolution and Attention-Free Mamba-based Cardiac Image Segmentation

Abbas Khan,Muhammad Asad,Martin Benning,Caroline Roney,Gregory Slabaugh
2024-10-29
Abstract:Convolutional Neural Networks (CNNs) and Transformer-based self-attention models have become the standard for medical image segmentation. This paper demonstrates that convolution and self-attention, while widely used, are not the only effective methods for segmentation. Breaking with convention, we present a Convolution and self-Attention-free Mamba-based semantic Segmentation Network named CAMS-Net. Specifically, we design Mamba-based Channel Aggregator and Spatial Aggregator, which are applied independently in each encoder-decoder stage. The Channel Aggregator extracts information across different channels, and the Spatial Aggregator learns features across different spatial locations. We also propose a Linearly Interconnected Factorized Mamba (LIFM) block to reduce the computational complexity of a Mamba block and to enhance its decision function by introducing a non-linearity between two factorized Mamba blocks. Our model outperforms the existing state-of-the-art CNN, self-attention, and Mamba-based methods on CMR and M&Ms-2 Cardiac segmentation datasets, showing how this innovative, convolution, and self-attention-free method can inspire further research beyond CNN and Transformer paradigms, achieving linear complexity and reducing the number of parameters. Source code and pre-trained models are available at: <a class="link-external link-https" href="https://github.com/kabbas570/CAMS-Net" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the cardiac image segmentation task, although existing convolutional neural networks (CNNs) and Transformer models based on the self - attention mechanism are widely used, they are not the only effective segmentation methods. Specifically: 1. **Limitations of CNNs**: Although CNNs perform well in local feature extraction, they have a limited receptive field, making it difficult to effectively capture long - distance dependencies, and tend to recognize textures rather than shapes. 2. **Limitations of the self - attention mechanism**: Although the self - attention mechanism can capture global information and long - distance dependencies, its computational complexity is quadratic, resulting in high computational costs and large memory requirements. To overcome these limitations, the paper proposes a new cardiac image segmentation network without convolution and self - attention mechanisms - CAMS - Net (Convolution and Attention - Free Mamba - based Cardiac Image Segmentation Network). By introducing Mamba blocks and their variants, this network achieves linear computational complexity while maintaining the ability to model global receptive fields and long - distance dependencies. ### Main contributions 1. **Proposing CAMS - Net**: This is the first Mamba - based cardiac image segmentation network that does not use convolution and self - attention mechanisms at all. 2. **Linearly Interconnected Factorized Mamba (LIFM) block**: By factorizing the Mamba block and introducing non - linearity, the number of parameters is reduced and the non - linear ability of the model is improved. 3. **Mamba Channel Aggregator (MCA) and Mamba Spatial Aggregator (MSA)**: They are used to extract information in the channel and spatial dimensions respectively. 4. **Extensive experimental verification**: Through experiments on the CMR and M&Ms - 2 datasets, it is proved that CAMS - Net is superior to existing CNN, self - attention mechanism and hybrid architecture methods in performance. ### Method overview - **Input processing**: The input image is converted into non - overlapping 2x2 patches and projected into a 64 - dimensional feature space through a linear embedding layer. In addition, position encoding is added to preserve spatial context information. - **Encoder - decoder structure**: In each encoder stage, features are down - sampled through a 2x2 average pooling layer. In the bottleneck layer and decoder stage, the CS - IF module is used to fuse channel and spatial information. - **Decoder**: In each decoder stage, features are up - sampled by bilinear interpolation and further processed by the CS - IF module and MCA module. - **Final output**: Five - class segmentation maps (left atrium, right atrium, left ventricle, right ventricle and background) are generated and classified by the Softmax activation function. ### Experimental results - **CMR dataset**: CAMS - Net outperforms existing methods in multiple metrics, especially in Dice Score and Hausdorff Distance. - **M&Ms - 2 dataset**: CAMS - Net also performs well in multi - center, multi - view, multi - disease clinical scenarios, especially in the RV segmentation task. ### Conclusion By proposing CAMS - Net, the paper shows the potential of convolution - and self - attention - mechanism - free methods in cardiac image segmentation. It not only outperforms existing methods in performance, but also has significant advantages in computational efficiency and the number of parameters. This provides a new direction for future research and promotes the development of medical image segmentation technology.