Abstract:Numerous studies have demonstrated the strong performance of Vision Transformer (ViT)-based methods across various computer vision tasks. However, ViT models often struggle to effectively capture high-frequency components in images, which are crucial for detecting small targets and preserving edge details, especially in complex scenarios. This limitation is particularly challenging in colon polyp segmentation, where polyps exhibit significant variability in structure, texture, and shape. High-frequency information, such as boundary details, is essential for achieving precise semantic segmentation in this context. To address these challenges, we propose HiFiSeg, a novel network for colon polyp segmentation that enhances high-frequency information processing through a global-local vision transformer framework. HiFiSeg leverages the pyramid vision transformer (PVT) as its encoder and introduces two key modules: the global-local interaction module (GLIM) and the selective aggregation module (SAM). GLIM employs a parallel structure to fuse global and local information at multiple scales, effectively capturing fine-grained features. SAM selectively integrates boundary details from low-level features with semantic information from high-level features, significantly improving the model's ability to accurately detect and segment polyps. Extensive experiments on five widely recognized benchmark datasets demonstrate the effectiveness of HiFiSeg for polyp segmentation. Notably, the mDice scores on the challenging CVC-ColonDB and ETIS datasets reached 0.826 and 0.822, respectively, underscoring the superior performance of HiFiSeg in handling the specific complexities of this task.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the task of colon polyp segmentation, existing methods are difficult to effectively capture high - frequency components in images, especially in terms of small - object detection and boundary - detail preservation. Specifically, although the Vision Transformer (ViT) performs well in various computer - vision tasks, it has limitations when dealing with high - frequency information, which leads to inaccurate detection of small objects and boundary details in complex scenarios (such as colon polyp segmentation). To solve these problems, the paper proposes the HiFiSeg network, which enhances the high - frequency information - processing ability through a global - local vision - transformer framework to improve the accuracy of colon polyp segmentation. ### Main contributions: 1. **Propose the HiFiSeg network**: Use the Pyramid Vision Transformer (PVT) as an encoder to capture more powerful features than CNN - based methods. 2. **Design two key modules**: The Global - Local Interaction Module (GLIM) and the Selective Aggregation Module (SAM). GLIM fuses global and local information through multi - scale convolution kernels and parallel structures to extract fine - grained features; SAM selectively fuses low - level boundary details and high - level semantic information to reduce the boundary - blurring problem. 3. **Experimental verification**: Evaluate HiFiSeg on five standard benchmark datasets, including Kvasir, CVC - ClinicDB, CVC - 300, CVC - ColonDB and ETIS. Especially on the challenging CVC - ColonDB and ETIS datasets, HiFiSeg achieves mDice scores of 0.826 and 0.822 respectively, surpassing the existing state - of - the - art methods. ### Background and motivation: - **Importance of polyp detection**: Colorectal cancer usually originates from colon polyps, especially adenomatous polyps. Therefore, early detection and resection are crucial for preventing cancer progression. - **Limitations of existing methods**: Traditional manual annotation is time - consuming and error - prone, and automated and accurate image - segmentation methods are required to assist diagnosis. Although deep - learning algorithms (especially CNN) have achieved remarkable success in medical - image applications, due to their limited receptive fields, these methods have difficulty in capturing long - range dependencies and global contexts. - **Advantages and challenges of Transformer**: Transformer can capture complex spatial transformations and long - range dependencies through the Multi - Head Self - Attention (MHSA), but it performs poorly in image locality and translation invariance, affecting the accurate segmentation of small objects and boundaries. ### Solutions: - **HiFiSeg framework**: Combine the PVT encoder, GLIM and SAM modules to effectively solve the problems of high - frequency information processing and fine - grained feature extraction. - **GLIM module**: Through multi - scale convolution kernels and parallel structures, fuse global and local information to extract fine - grained features. - **SAM module**: Selectively fuse low - level boundary details and high - level semantic information to reduce the boundary - blurring problem and improve the detection and segmentation accuracy of the model. ### Experimental results: - **Quantitative evaluation**: The experimental results on multiple datasets show that HiFiSeg outperforms other methods in terms of indicators such as mDice, mIoU and MAE. - **Qualitative evaluation**: Visualization results indicate that HiFiSeg performs excellently in capturing small objects and boundary details, can accurately identify colon tissues and polyps, and maintains stable identification and segmentation capabilities under different imaging conditions. In conclusion, this paper effectively solves the problems of high - frequency information processing and fine - grained feature extraction in the colon polyp segmentation task by proposing the HiFiSeg network, providing a new solution for medical - image segmentation.

HiFiSeg: High-Frequency Information Enhanced Polyp Segmentation with Global-Local Vision Transformer

Probabilistic Modeling Ensemble Vision Transformer Improves Complex Polyp Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

HIGF-Net: Hierarchical Information-Guided Fusion Network for Polyp Segmentation Based on Transformer and Convolution Feature Learning

SegT: A Novel Separated Edge-guidance Transformer Network for Polyp Segmentation

RetSeg: Retention-based Colorectal Polyps Segmentation Network

Improving Polyp Segmentation with Boundary-Assisted Guidance and Cross-Scale Interaction Fusion Transformer Network

PolySegNet: improving polyp segmentation through swin transformer and vision transformer fusion

Polyp-PVT: Polyp Segmentation with Pyramid Vision Transformers

Utilizing adaptive deformable convolution and position embedding for colon polyp segmentation with a visual transformer

PolyPooling: An accurate polyp segmentation from colonoscopy images

FCN-Transformer Feature Fusion for Polyp Segmentation

Multi-Layer Dense Attention Decoder for Polyp Segmentation

Polyp Segmentation With the FCB-SwinV2 Transformer

UViT-Seg: An Efficient ViT and U-Net-Based Framework for Accurate Colorectal Polyp Segmentation in Colonoscopy and WCE Images

Polyp Segmentation Using a Hybrid Vision Transformer and a Hybrid Loss Function

NA-segformer: A multi-level transformer model based on neighborhood attention for colonoscopic polyp segmentation

ECTransNet: An Automatic Polyp Segmentation Network Based on Multi-scale Edge Complementary

CIFG-Net: Cross-level information fusion and guidance network for Polyp Segmentation

TMPSformer: An Efficient Hybrid Transformer-MLP Network for Polyp Segmentation

SAEFormer: stepwise attention emphasis transformer for polyp segmentation