BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation

Omid Nejati Manzari,Javad Mirzapour Kaleybar,Hooman Saadat,Shahin Maleki
2024-02-14
Abstract:The accurate segmentation of medical images is critical for various healthcare applications. Convolutional neural networks (CNNs), especially Fully Convolutional Networks (FCNs) like U-Net, have shown remarkable success in medical image segmentation tasks. However, they have limitations in capturing global context and long-range relations, especially for objects with significant variations in shape, scale, and texture. While transformers have achieved state-of-the-art results in natural language processing and image recognition, they face challenges in medical image segmentation due to image locality and translational invariance issues. To address these challenges, this paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The dual-branch encoder consists of an edge encoder and a body encoder. The edge encoder employs PDC blocks for effective edge information extraction, while the body encoder uses the Swin Transformer to capture semantic information with global attention. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities. This local approach significantly reduces computational complexity compared to global cross-attention while ensuring accurate feature matching. BEFUnet demonstrates superior performance over existing methods across various evaluation metrics on medical image segmentation datasets.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the issue of precise medical image segmentation by proposing a hybrid CNN-Transformer architecture named **BEFUnet**. #### Background and Motivation 1. **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: Despite their excellent performance in medical image segmentation tasks (e.g., U-Net), their local convolution operations limit the ability to capture global context and long-range relationships, especially in objects with significant variations in shape, scale, and texture. - **Transformer Models**: Although they have achieved state-of-the-art results in natural language processing and image recognition tasks, they face challenges in medical image segmentation due to image locality and translation invariance. 2. **Combining the Strengths of CNN and Transformer**: - **CNN**: Excels at capturing local information and edge features. - **Transformer**: Excels at capturing global context and long-range dependencies. #### Proposed Method BEFUnet is an innovative U-shaped network that includes three main modules: 1. **Dual-Branch Encoder**: Composed of an edge encoder and a body encoder. - **Edge Encoder**: Uses PDC blocks to extract effective edge information. - **Body Encoder**: Utilizes Swin Transformer to capture semantic information with global attention. 2. **Local Cross-Attention Fusion Module (LCAF)**: Efficiently fuses edge and body features by selectively performing local cross-attention operations on spatially proximate features between the two modalities, significantly reducing computational complexity while ensuring accurate feature matching. 3. **Dual-Level Fusion Module (DLF)**: Effectively fuses coarse-grained and fine-grained feature representations. #### Main Contributions 1. **Innovative Hybrid Approach**: Combines the edge-local semantic information of CNNs and the body-contextual interaction of Transformers, enhancing the fusion of complementary features, particularly suitable for handling irregular and challenging boundaries. 2. **Dual-Level Fusion Module**: Effectively fuses coarse-grained and fine-grained feature representations. 3. **Experimental Results**: Extensive training and evaluation on three different medical image segmentation datasets demonstrate that BEFUnet outperforms various state-of-the-art models across multiple evaluation metrics, validating its robustness and superiority.