Abstract:Vision Transformer (ViT) has emerged as a potential alternative to convolutional neural networks for large datasets. However, applying ViT directly to medical image segmentation is challenging due to its lack of induction bias, which requires a large number of high-quality annotated medical images for effective model training. Recent studies have discovered that, in addition to the increased model capacity and generalization resulting from the lack of induction bias, the excellent performance of Transformer can also be attributed to its large receptive field. In this paper, we propose a U-shaped medical image segmentation model that combines large kernel convolutions with Transformers. Specifically, we construct a basic Transformer unit using pyramidal convolution modules with multi-scale kernels and multi-layer perceptron. In the pyramid convolution module, we employ grouped convolution to reduce parameter and computational complexity while utilizing multi-scale large kernel attention as a foundation for more efficient feature extraction. For different types of grouping, different sizes of convolutions are used to enhance the extraction of features with multiple receptive fields. To optimize the extracted features from the encoder, the U-shaped model integrates a variant of the pyramidal convolutional module into the skip connections. This variant utilizes multi-scale large kernel convolutional attention based on channel splitting. The incorporation of this variant enables efficient refinement of the feature representations within the skip connections. Through extensive comparisons on multi-modal medical image datasets, our model outperforms state-of-the-art methods across various evaluation metrics, with notable superiority observed on small-scale medical datasets. Our research findings suggest that the combination of large kernel convolutions and Transformer models introduces an advantageous inductive bias, resulting in enhanced performance specifically for small-scale medical image datasets. To facilitate accessibility, we have made our code openly accessible on our GitHub repository, which can be found at https://github.com/medical-images-process/CNN-Transformer .

Dual Transformer Encoder Model for Medical Image Classification

TransMed: Transformers Advance Multi-Modal Medical Image Classification

MedViT: A robust vision transformer for generalized medical image classification

DBCvT: Double Branch Convolutional Transformer for Medical Image Classification

Hybrid CNN-Transformer model for medical image segmentation with pyramid convolution and multi-layer perceptron

Token labeling-guided multi-scale medical image classification

Multi-label classification of retinal disease via a novel vision transformer model

LC2R-ViT: Long-Range Cross-Residual Vision Transformer for Medical Image Classification

Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification

Multi-Modal Fusion Transformer for Multivariate Time Series Classification

Cross Attention Multi Scale CNN-Transformer Hybrid Encoder is General Medical Image Learner.

Cats: Complementary CNN and Transformer Encoders for Segmentation

DS-Former: A Dual-Stream Encoding-Based Transformer for 3D Medical Image Segmentation

Improved EATFormer: A Vision Transformer for Medical Image Classification

ClassFormer: Exploring Class-Aware Dependency with Transformer for Medical Image Segmentation

Implementing vision transformer for classifying 2D biomedical images

Using Vision Transformers in 3-D Medical Image Classifications

Dual encoder network with transformer-CNN for multi-organ segmentation

A New Perspective to Boost Vision Transformer for Medical Image Classification

An Arrhythmia Classification Model Based on Vision Transformer with Deformable Attention

Distance Restricted Transformer Encoder for Multi-Label Classification