Abstract:The accurate segmentation of medical images is critical for various healthcare applications. Convolutional neural networks (CNNs), especially Fully Convolutional Networks (FCNs) like U-Net, have shown remarkable success in medical image segmentation tasks. However, they have limitations in capturing global context and long-range relations, especially for objects with significant variations in shape, scale, and texture. While transformers have achieved state-of-the-art results in natural language processing and image recognition, they face challenges in medical image segmentation due to image locality and translational invariance issues. To address these challenges, this paper proposes an innovative U-shaped network called BEFUnet, which enhances the fusion of body and edge information for precise medical image segmentation. The BEFUnet comprises three main modules, including a novel Local Cross-Attention Feature (LCAF) fusion module, a novel Double-Level Fusion (DLF) module, and dual-branch encoder. The dual-branch encoder consists of an edge encoder and a body encoder. The edge encoder employs PDC blocks for effective edge information extraction, while the body encoder uses the Swin Transformer to capture semantic information with global attention. The LCAF module efficiently fuses edge and body features by selectively performing local cross-attention on features that are spatially close between the two modalities. This local approach significantly reduces computational complexity compared to global cross-attention while ensuring accurate feature matching. BEFUnet demonstrates superior performance over existing methods across various evaluation metrics on medical image segmentation datasets.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of precise medical image segmentation by proposing a hybrid CNN-Transformer architecture named **BEFUnet**. #### Background and Motivation 1. **Limitations of Existing Methods**: - **Convolutional Neural Networks (CNNs)**: Despite their excellent performance in medical image segmentation tasks (e.g., U-Net), their local convolution operations limit the ability to capture global context and long-range relationships, especially in objects with significant variations in shape, scale, and texture. - **Transformer Models**: Although they have achieved state-of-the-art results in natural language processing and image recognition tasks, they face challenges in medical image segmentation due to image locality and translation invariance. 2. **Combining the Strengths of CNN and Transformer**: - **CNN**: Excels at capturing local information and edge features. - **Transformer**: Excels at capturing global context and long-range dependencies. #### Proposed Method BEFUnet is an innovative U-shaped network that includes three main modules: 1. **Dual-Branch Encoder**: Composed of an edge encoder and a body encoder. - **Edge Encoder**: Uses PDC blocks to extract effective edge information. - **Body Encoder**: Utilizes Swin Transformer to capture semantic information with global attention. 2. **Local Cross-Attention Fusion Module (LCAF)**: Efficiently fuses edge and body features by selectively performing local cross-attention operations on spatially proximate features between the two modalities, significantly reducing computational complexity while ensuring accurate feature matching. 3. **Dual-Level Fusion Module (DLF)**: Effectively fuses coarse-grained and fine-grained feature representations. #### Main Contributions 1. **Innovative Hybrid Approach**: Combines the edge-local semantic information of CNNs and the body-contextual interaction of Transformers, enhancing the fusion of complementary features, particularly suitable for handling irregular and challenging boundaries. 2. **Dual-Level Fusion Module**: Effectively fuses coarse-grained and fine-grained feature representations. 3. **Experimental Results**: Extensive training and evaluation on three different medical image segmentation datasets demonstrate that BEFUnet outperforms various state-of-the-art models across multiple evaluation metrics, validating its robustness and superiority.

BEFUnet: A Hybrid CNN-Transformer Architecture for Precise Medical Image Segmentation

Mixed Transformer U-Net for Medical Image Segmentation

FCTrans UNet: A Hybrid CNN and Transformer Model for Medical Image Segmentations

Focal-UNet: UNet-like Focal Modulation for Medical Image Segmentation

Dilated-UNet: A Fast and Accurate Medical Image Segmentation Approach using a Dilated Transformer and U-Net Architecture

Feature-enhanced fusion of U-NET-based improved brain tumor images segmentation

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

FTUNet: A Feature-Enhanced Network for Medical Image Segmentation Based on the Combination of U-Shaped Network and Vision Transformer

Sfe-Transunet: A Transformer-Based U-Net With Skipped Features Enhancer For Medical Image Segmentation

A novel full-convolution UNet-transformer for medical image segmentation

FDB-Net: Fusion Double Branch Network Combining CNN and Transformer for Medical Image Segmentation

D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and Subtle Segmentation in Medical Images

MFH‐Net: A Hybrid CNN‐Transformer Network Based Multi‐Scale Fusion for Medical Image Segmentation

TSCA-Net: Transformer based spatial-channel attention segmentation network for medical images

DEU-Net: Dual Encoder U-Net for 3D Medical Image Segmentation

CFATransUnet: Channel-wise cross fusion attention and transformer for 2D medical image segmentation

UNETR: Transformers for 3D Medical Image Segmentation

UCTNet: Uncertainty-guided CNN-Transformer hybrid networks for medical image segmentation

FIF-UNet: An Efficient UNet Using Feature Interaction and Fusion for Medical Image Segmentation

DCFNet: An Effective Dual-Branch Cross-Attention Fusion Network for Medical Image Segmentation

BiFTransNet: A unified and simultaneous segmentation network for gastrointestinal images of CT & MRI