FilterViT and DropoutViT: Lightweight Vision Transformer Models for Efficient Attention Mechanisms

Bohang Sun
2024-10-30
Abstract:In this study, we introduce FilterViT, an enhanced version of MobileViT, which leverages an attention-based mechanism for early-stage downsampling. Traditional QKV operations on high-resolution feature maps are computationally intensive due to the abundance of tokens. To address this, we propose a filter attention mechanism using a convolutional neural network (CNN) to generate an importance mask, focusing attention on key image regions. The method significantly reduces computational complexity while maintaining interpretability, as it highlights essential image areas. Experimental results show that FilterViT achieves substantial gains in both efficiency and accuracy compared to other models. We also introduce DropoutViT, a variant that uses a stochastic approach for pixel selection, further enhancing robustness.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to address the issue of high computational complexity in Vision Transformer (ViT) models when processing high-resolution images. Specifically, during the QKV operations, ViT consumes a significant amount of computational resources due to the large size of high-resolution feature maps and the vast number of tokens. To solve this problem, the authors propose an improved MobileViT variant—FilterVIT. ### Main Issues: 1. **High Computational Complexity**: The computational complexity of ViT increases quadratically when performing QKV operations on high-resolution images, leading to enormous consumption of computational resources. 2. **Information Redundancy**: Not all pixels contribute equally to the final prediction. Many pixels are noise or irrelevant, while key pixels are crucial for decision-making. 3. **Efficiency of Attention Mechanism**: Traditional attention mechanisms are inefficient when processing high-resolution images due to the need to handle a large number of tokens. ### Solutions: 1. **Introduction of Filter Attention Mechanism**: By using Convolutional Neural Networks (CNN) to generate an importance mask (Filter Mask), the most critical pixels in the feature map are selected for attention computation. This significantly reduces the number of tokens involved in the attention computation, thereby lowering computational complexity. 2. **Selective Attention**: By scoring the pixels in the feature map and selecting the top K pixels, the model can focus on the most relevant areas of the image, improving computational efficiency and accuracy. 3. **Interpretability**: The importance mask not only reduces the computational burden but also provides model interpretability, as the key areas the model focuses on can be visualized through the mask. ### Contributions: 1. **Lightweight and Efficient Attention Mechanism**: The filter attention mechanism enables fine-grained attention computation while reducing computational complexity. 2. **Interpretability**: The model can highlight the most important parts of the image, enhancing the model's interpretability. 3. **DropoutVIT Variant**: By randomly selecting pixels for attention computation, the flexibility and robustness of the method are further validated. In summary, this paper addresses the computational efficiency issue of ViT in high-resolution image processing by introducing the filter attention mechanism, while maintaining high accuracy and model interpretability.