DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Xiaoya Tang,Bodong Zhang,Beatrice S. Knudsen,Tolga Tasdizen
2024-07-19
Abstract:We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at <a class="link-external link-https" href="https://github.com/xiaoyatang/DuoFormer.git" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issues of Vision Transformers (ViTs) in handling medical images, particularly their lack of inductive bias and reliance on large-scale training data. Specifically, the paper proposes a new hierarchical transformer model called DuoFormer, which combines the advantages of Convolutional Neural Networks (CNNs) and Vision Transformers to enhance the model's ability to understand and capture multi-scale features. The main contributions are as follows: 1. **Multi-scale Tokenization**: Assembling multi-scale features from different stages of CNNs through single-layer projection, patch indexing, and concatenation. 2. **Dual Attention Mechanism**: Introducing a new mechanism called "scale attention" combined with patch attention, enabling the model to recognize cross-scale connections, expand the receptive field of ViT, and bridge the gap between CNN and transformer architectures. 3. **Scale Token**: Initializing a fused embedding scale token to aggregate scale information and serve as the input for global patch attention. Experimental results show that DuoFormer significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalization capability. Additionally, the design of this model allows for interchangeable use with other different CNN architectures, offering high flexibility.