Abstract:Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at <a class="link-external link-https" href="https://github.com/xianlin7/ConvFormer" rel="external noopener nofollow">this https URL</a>.

EAT: an Enhancer for Aesthetics-Oriented Transformers

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

EAPT: Efficient Attention Pyramid Transformer for Image Processing

Improved EATFormer: A Vision Transformer for Medical Image Classification

How Powerful Potential of Attention on Image Restoration?

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Edge-Aware Attention Transformer for Image Super-Resolution

Illumination Adaptive Transformer.

Image aesthetics assessment using composite features from transformer and CNN

Hybrid CNN-Transformer based Meta-Learning Approach for Personalized Image Aesthetics Assessment

FAM: Improving columnar vision transformer with feature attention mechanism

Lite Vision Transformer with Enhanced Self-Attention

EAT: epipolar-aware Transformer for low-light light field enhancement

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction

Vision Transformer with Sparse Scan Prior

DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation

Data Augmentation Vision Transformer for Fine-grained Image Classification

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition