Abstract:Vision Transformer (ViT) has emerged as a potential alternative to convolutional neural networks for large datasets. However, applying ViT directly to medical image segmentation is challenging due to its lack of induction bias, which requires a large number of high-quality annotated medical images for effective model training. Recent studies have discovered that, in addition to the increased model capacity and generalization resulting from the lack of induction bias, the excellent performance of Transformer can also be attributed to its large receptive field. In this paper, we propose a U-shaped medical image segmentation model that combines large kernel convolutions with Transformers. Specifically, we construct a basic Transformer unit using pyramidal convolution modules with multi-scale kernels and multi-layer perceptron. In the pyramid convolution module, we employ grouped convolution to reduce parameter and computational complexity while utilizing multi-scale large kernel attention as a foundation for more efficient feature extraction. For different types of grouping, different sizes of convolutions are used to enhance the extraction of features with multiple receptive fields. To optimize the extracted features from the encoder, the U-shaped model integrates a variant of the pyramidal convolutional module into the skip connections. This variant utilizes multi-scale large kernel convolutional attention based on channel splitting. The incorporation of this variant enables efficient refinement of the feature representations within the skip connections. Through extensive comparisons on multi-modal medical image datasets, our model outperforms state-of-the-art methods across various evaluation metrics, with notable superiority observed on small-scale medical datasets. Our research findings suggest that the combination of large kernel convolutions and Transformer models introduces an advantageous inductive bias, resulting in enhanced performance specifically for small-scale medical image datasets. To facilitate accessibility, we have made our code openly accessible on our GitHub repository, which can be found at https://github.com/medical-images-process/CNN-Transformer .

Transformer-Based Disease Identification for Small-Scale Imbalanced Capsule Endoscopy Dataset

SatFormer: Saliency-Guided Abnormality-Aware Transformer for Retinal Disease Classification in Fundus Image

ViTCA-Net: a framework for disease detection in video capsule endoscopy images using a vision transformer and convolutional neural network with a specific attention mechanism

Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification

Research and implementation of multi-disease diagnosis on chest X-ray based on vision transformer

Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron

Gastrointestinal Disorder Detection with a Transformer Based Approach

Pathological Insights: Enhanced Vision Transformers for the Early Detection of Colorectal Cancer

MIL-ViT: A Multiple Instance Vision Transformer for Fundus Image Classification

Vision transformer with masked autoencoders for referable diabetic retinopathy classification based on large-size retina image

Implementing vision transformer for classifying 2D biomedical images

Multi-label classification of retinal disease via a novel vision transformer model

Data-Efficient Vision Transformers for Multi-Label Disease Classification on Chest Radiographs

Classification of Endoscopy and Video Capsule Images using CNN-Transformer Model

Utilizing adaptive deformable convolution and position embedding for colon polyp segmentation with a visual transformer

A vision transformer for emphysema classification using CT images

Vision Transformers for Small Histological Datasets Learned through Knowledge Distillation

Multi-Class Abnormality Classification Task in Video Capsule Endoscopy

Application of Vision-Series Transformer in Screening for Coronary Heart Diseases Using Coronary CT Angiography.

A Deep Learning Application of Capsule Endoscopic Gastric Structure Recognition Based on a Transformer Model

Anatomical sites identification in both ordinary and capsule gastroduodenoscopy via deep learning