DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang,Bodong Zhang,Beatrice S. Knudsen,Tolga Tasdizen

2024-07-19

Abstract:We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at <a class="link-external link-https" href="https://github.com/xiaoyatang/DuoFormer.git" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issues of Vision Transformers (ViTs) in handling medical images, particularly their lack of inductive bias and reliance on large-scale training data. Specifically, the paper proposes a new hierarchical transformer model called DuoFormer, which combines the advantages of Convolutional Neural Networks (CNNs) and Vision Transformers to enhance the model's ability to understand and capture multi-scale features. The main contributions are as follows: 1. **Multi-scale Tokenization**: Assembling multi-scale features from different stages of CNNs through single-layer projection, patch indexing, and concatenation. 2. **Dual Attention Mechanism**: Introducing a new mechanism called "scale attention" combined with patch attention, enabling the model to recognize cross-scale connections, expand the receptive field of ViT, and bridge the gap between CNN and transformer architectures. 3. **Scale Token**: Initializing a fused embedding scale token to aggregate scale information and serve as the input for global patch attention. Experimental results show that DuoFormer significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalization capability. Additionally, the design of this model allows for interchangeable use with other different CNN architectures, offering high flexibility.

DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation.

Convolutional Embedding Makes Hierarchical Vision Transformer Stronger

ConvFormer: Plug-and-Play CNN-Style Transformers for Improving Medical Image Segmentation

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

ASAFormer: Visual tracking with convolutional vision transformer and asymmetric selective attention

H2Former: An Efficient Hierarchical Hybrid Transformer for Medical Image Segmentation

Dual-Dependency Attention Transformer for Fine-Grained Visual Classification

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

RegionViT: Regional-to-Local Attention for Vision Transformers

Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets

A novel dual-granularity lightweight transformer for vision tasks

DDViT: Double-Level Fusion Domain Adapter Vision Transformer (Student Abstract)

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

Vision Transformers with Hierarchical Attention

Feature‐enhanced representation with transformers for multi‐view stereo

Deep Hierarchical Vision Transformer for Hyperspectral and LiDAR Data Classification

HD-Former: A hierarchical dependency Transformer for medical image segmentation

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

HA-Transformer: Harmonious aggregation from local to global for object detection

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition