Abstract:Transformers have been extensively studied in medical image segmentation to build pairwise long-range dependence. Yet, relatively limited well-annotated medical image data makes transformers struggle to extract diverse global features, resulting in attention collapse where attention maps become similar or even identical. Comparatively, convolutional neural networks (CNNs) have better convergence properties on small-scale training data but suffer from limited receptive fields. Existing works are dedicated to exploring the combinations of CNN and transformers while ignoring attention collapse, leaving the potential of transformers under-explored. In this paper, we propose to build CNN-style Transformers (ConvFormer) to promote better attention convergence and thus better segmentation performance. Specifically, ConvFormer consists of pooling, CNN-style self-attention (CSA), and convolutional feed-forward network (CFFN) corresponding to tokenization, self-attention, and feed-forward network in vanilla vision transformers. In contrast to positional embedding and tokenization, ConvFormer adopts 2D convolution and max-pooling for both position information preservation and feature size reduction. In this way, CSA takes 2D feature maps as inputs and establishes long-range dependency by constructing self-attention matrices as convolution kernels with adaptive sizes. Following CSA, 2D convolution is utilized for feature refinement through CFFN. Experimental results on multiple datasets demonstrate the effectiveness of ConvFormer working as a plug-and-play module for consistent performance improvement of transformer-based frameworks. Code is available at <a class="link-external link-https" href="https://github.com/xianlin7/ConvFormer" rel="external noopener nofollow">this https URL</a>.

Learning confidence measure with transformer in stereo matching

Consformer: Consciousness Detection Using Transformer Networks With Correntropy-Based Measures

A Transformer-Based Architecture for High-Resolution Stereo Matching

End-to-end information fusion method for transformer-based stereo matching

Modeling Stereo-Confidence Out of the End-to-End Stereo-Matching Network via Disparity Plane Sweep

UniTT-Stereo: Unified Training of Transformer for Enhanced Stereo Matching

Conformer: Local Features Coupling Global Representations for Visual Recognition

ChiTransformer:Towards Reliable Stereo from Cues

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints

Playing to Vision Foundation Model's Strengths in Stereo Matching

Feature‐enhanced representation with transformers for multi‐view stereo

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

CNN-based Cost Volume Analysis as Confidence Measure for Dense Matching

Improving Transformer-based Networks With Locality For Automatic Speaker Verification

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Shape-Former: Bridging CNN and Transformer via ShapeConv for multimodal image matching

Conformer: Local Features Coupling Global Representations for Recognition and Detection

Simultaneous Stereo Matching and Confidence Estimation Network

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

HLocalExp-CM: confidence map by hierarchical local expansion moves for accurate stereo matching