Abstract:Medical image segmentation is crucial for obtaining accurate diagnoses, and while convolutional neural network (CNN)-based methods have made strides in recent years, they struggle with modeling long-range dependencies. Transformer-based methods improve this task but require more computational resources. The segment anything model (SAM) can generate pixel-level segmentation results for natural images using sparse manual prompts, but it performs poorly on low-contrast, noisy ultrasound images. To address this issue, we propose a new medical image segmentation network architecture that integrates transformer components, CNN modules, and an SAM encoder into a unified framework. This allows us to simultaneously capture both long-range dependencies and local features. Additionally, we incorporate the image features extracted from the SAM model as prior knowledge to achieve further improved segmentation accuracy with limited training data. To reduce the imposed computational stress, we employ an axial attention mechanism to approximate a transformer's effects by expanding the receptive field. Instead of replacing the transformer components with lightweight attention modules, our model is divided into a global branch and a local branch. The global branch extracts context features with the transformer components, while the local branch processes patch tokens with the axial attention mechanism. We also construct an image pyramid to excavate internal statistics and multiscale representations to obtain more accurate segmentation regions. This bibranch pyramid transformer (Bi-BPT) architecture is effective and robust for medical image segmentation, surpassing other related segmentation network architectures. The experimental results obtained on various medical image datasets demonstrate its effectiveness.

Multi-branch Input Structure for Pyramid Scene Parsing Network

A Multi-Branch Feature Fusion Strategy Based on an Attention Mechanism for Remote Sensing Image Scene Classification

Real-Time Semantic Segmentation via an Efficient Multi-Column Network

DPNet: Dual-Pyramid Semantic Segmentation Network Based on Improved Deeplabv3 Plus

Synthetical application of multi-feature map detection and multi-branch convolution

Attention Pyramid Module for Scene Recognition

FCPFNet: Feature Complementation Network with Pyramid Fusion for Semantic Segmentation

Pyramid Architecture for Multi-Scale Processing in Point Cloud Segmentation

Bilateral Network with Residual U-blocks and Dual-Guided Attention for Real-time Semantic Segmentation

PCANet: Pyramid convolutional attention network for semantic segmentation

Quadtree Generating Networks: Efficient Hierarchical Scene Parsing with Sparse Convolutions

A Multi-Step Fusion Network for Semantic Segmentation of High-Resolution Aerial Images

EPRNet: Efficient Pyramid Representation Network for Real-Time Street Scene Segmentation

A Unified Efficient Pyramid Transformer for Semantic Segmentation

Attention-guided chained context aggregation for semantic segmentation

Multi-Level Aggregation and Recursive Alignment Architecture for Efficient Parallel Inference Segmentation Network

Integrating prior knowledge into a bibranch pyramid network for medical image segmentation

Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search

Semantic segmentation based on double pyramid network with improved global attention mechanism

Multi-layer Feature Aggregation for Deep Scene Parsing Models

CMPF-UNet: a ConvNeXt multi-scale pyramid fusion U-shaped network for multi-category segmentation of remote sensing images