Abstract:In this study, we introduce FilterViT, an enhanced version of MobileViT, which leverages an attention-based mechanism for early-stage downsampling. Traditional QKV operations on high-resolution feature maps are computationally intensive due to the abundance of tokens. To address this, we propose a filter attention mechanism using a convolutional neural network (CNN) to generate an importance mask, focusing attention on key image regions. The method significantly reduces computational complexity while maintaining interpretability, as it highlights essential image areas. Experimental results show that FilterViT achieves substantial gains in both efficiency and accuracy compared to other models. We also introduce DropoutViT, a variant that uses a stochastic approach for pixel selection, further enhancing robustness.

What problem does this paper attempt to address?

This paper attempts to address the issue of high computational complexity in Vision Transformer (ViT) models when processing high-resolution images. Specifically, during the QKV operations, ViT consumes a significant amount of computational resources due to the large size of high-resolution feature maps and the vast number of tokens. To solve this problem, the authors propose an improved MobileViT variant—FilterVIT. ### Main Issues: 1. **High Computational Complexity**: The computational complexity of ViT increases quadratically when performing QKV operations on high-resolution images, leading to enormous consumption of computational resources. 2. **Information Redundancy**: Not all pixels contribute equally to the final prediction. Many pixels are noise or irrelevant, while key pixels are crucial for decision-making. 3. **Efficiency of Attention Mechanism**: Traditional attention mechanisms are inefficient when processing high-resolution images due to the need to handle a large number of tokens. ### Solutions: 1. **Introduction of Filter Attention Mechanism**: By using Convolutional Neural Networks (CNN) to generate an importance mask (Filter Mask), the most critical pixels in the feature map are selected for attention computation. This significantly reduces the number of tokens involved in the attention computation, thereby lowering computational complexity. 2. **Selective Attention**: By scoring the pixels in the feature map and selecting the top K pixels, the model can focus on the most relevant areas of the image, improving computational efficiency and accuracy. 3. **Interpretability**: The importance mask not only reduces the computational burden but also provides model interpretability, as the key areas the model focuses on can be visualized through the mask. ### Contributions: 1. **Lightweight and Efficient Attention Mechanism**: The filter attention mechanism enables fine-grained attention computation while reducing computational complexity. 2. **Interpretability**: The model can highlight the most important parts of the image, enhancing the model's interpretability. 3. **DropoutVIT Variant**: By randomly selecting pixels for attention computation, the flexibility and robustness of the method are further validated. In summary, this paper addresses the computational efficiency issue of ViT in high-resolution image processing by introducing the filter attention mechanism, while maintaining high accuracy and model interpretability.

FilterViT and DropoutViT: Lightweight Vision Transformer Models for Efficient Attention Mechanisms

FasterViT: Fast Vision Transformers with Hierarchical Attention

No Token Left Behind: Efficient Vision Transformer via Dynamic Token Idling

SkipViT: Speeding Up Vision Transformers with a Token-Level Skip Connection

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

DeepViT: Towards Deeper Vision Transformer

Skip-Attention: Improving Vision Transformers by Paying Less Attention

DctViT: Discrete Cosine Transform Meet Vision Transformers

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification.

You Only Need Less Attention at Each Stage in Vision Transformers

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

ReViT: Enhancing Vision Transformers Feature Diversity with Attention Residual Connections

Super Vision Transformer

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

Mobile Attention: Mobile-Friendly Linear-Attention for Vision Transformers

Lightweight Vision Transformer with Cross Feature Attention

Vision Transformer: Vit and its Derivatives

Vision Big Bird: Random Sparsification for Full Attention