Abstract:Vision Transformer (ViT) has emerged as a potential alternative to convolutional neural networks for large datasets. However, applying ViT directly to medical image segmentation is challenging due to its lack of induction bias, which requires a large number of high-quality annotated medical images for effective model training. Recent studies have discovered that, in addition to the increased model capacity and generalization resulting from the lack of induction bias, the excellent performance of Transformer can also be attributed to its large receptive field. In this paper, we propose a U-shaped medical image segmentation model that combines large kernel convolutions with Transformers. Specifically, we construct a basic Transformer unit using pyramidal convolution modules with multi-scale kernels and multi-layer perceptron. In the pyramid convolution module, we employ grouped convolution to reduce parameter and computational complexity while utilizing multi-scale large kernel attention as a foundation for more efficient feature extraction. For different types of grouping, different sizes of convolutions are used to enhance the extraction of features with multiple receptive fields. To optimize the extracted features from the encoder, the U-shaped model integrates a variant of the pyramidal convolutional module into the skip connections. This variant utilizes multi-scale large kernel convolutional attention based on channel splitting. The incorporation of this variant enables efficient refinement of the feature representations within the skip connections. Through extensive comparisons on multi-modal medical image datasets, our model outperforms state-of-the-art methods across various evaluation metrics, with notable superiority observed on small-scale medical datasets. Our research findings suggest that the combination of large kernel convolutions and Transformer models introduces an advantageous inductive bias, resulting in enhanced performance specifically for small-scale medical image datasets. To facilitate accessibility, we have made our code openly accessible on our GitHub repository, which can be found at https://github.com/medical-images-process/CNN-Transformer .

A LLM-Based Hybrid-Transformer Diagnosis System in Healthcare

A Health Diagnosis System based on Transfer Learning and Multi-scale Dilated Convolution from Binocular Fundus Image Pairs

Medical Diagnosis with Large Scale Multimodal Transformers: Leveraging Diverse Data for More Accurate Diagnosis

A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics

Slimmable transformer with hybrid axial-attention for medical image segmentation

Evaluation of the time to "natural compensation" in normal and dry eye subject populations during exposure to a controlled adverse environment.

Anything goes on the path to universal health coverage? No.

Attention transformer mechanism and fusion-based deep learning architecture for MRI brain tumor classification system

A Local-Global Attention Fusion Framework with Tensor Decomposition for Medical Diagnosis

D-TrAttUnet: Toward Hybrid CNN-Transformer Architecture for Generic and Subtle Segmentation in Medical Images

A Hybrid Enhanced Attention Transformer Network for Medical Ultrasound Image Segmentation

Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron

TransMed: Transformers Advance Multi-Modal Medical Image Classification

Multi-modal medical image fusion based on densely-connected high-resolution CNN and hybrid transformer

Transformer-Based Joint Classification Network for Diabetic Retinopathy and Diabetic Macular Edema

Harnessing the power of longitudinal medical imaging for eye disease prognosis using Transformer-based sequence modeling

AResNet-ViT: A Hybrid CNN-Transformer Network for Benign and Malignant Breast Nodule Classification in Ultrasound Images

MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation

Hybrid transformer for lesion segmentation on adaptive optics retinal images