What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance of lightweight visual models, especially in computer vision tasks such as image classification, object detection, and semantic segmentation. Specifically, the author proposes an Alternating Fourier and Image - Domain Adaptive Filtering (AFIDAF) method to replace the computationally - intensive attention mechanism, thereby constructing an efficient visual backbone network. ### Main problems: 1. **Reduce computational complexity**: Although existing Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) in performance, their computational complexity is relatively high, especially in the attention mechanism part. This makes them difficult to be applied in resource - constrained environments or mobile devices. 2. **Improve the performance of lightweight models**: Although some existing lightweight models (such as MobileNets, ShuffleNets, etc.) have fewer parameters, their performance in some tasks is not as good as that of large - scale models. Therefore, how to improve their performance while maintaining the model's lightweight nature is an important research direction. 3. **Improve the limitations of AFFNet**: AFFNet (Adaptive Frequency Filters as Efficient Global Token Mixers) realizes global feature mixing through Fourier transform. However, in actual implementation, its channel - dimension filtering limits its performance in the frequency domain, resulting in poor performance in high - resolution or dense prediction tasks. ### Solutions proposed in the paper: - **Alternating Fourier and Image - Domain Adaptive Filtering (AFIDAF)**: By performing adaptive filtering alternately between the Fourier domain and the image domain, it combines the advantages of large - kernel convolution and Fourier transform, which can not only extract features locally but also perform effective feature mixing globally. This method not only improves the performance of the model but also maintains the lightweight characteristics of the model. - **Hierarchical AFIDAF (HAFIDAF)**: In order to further compress the ViT model, the author proposes a hierarchical AFIDAF framework based on Swin Transformer, which reduces the number of parameters while maintaining the high performance of the model in tasks such as image classification, object detection, and semantic segmentation. ### Experimental results: - **Image classification**: AFIDAF achieves a Top - 1 accuracy rate of 80.9% on the ImageNet - 1K dataset, which is better than other lightweight models, and has only 6.5M parameters and 1.5G FLOPs. - **Object detection**: On the MS - COCO dataset, the mAP of AFIDAF reaches 30.2%, which is significantly better than other lightweight detectors. - **Semical segmentation**: On the PASCAL VOC 2012 dataset, the mIoU of AFIDAF reaches 81.6%, which is also better than other lightweight models. In conclusion, by introducing the AFIDAF method, this paper successfully solves the balance problem between the performance and computational efficiency of lightweight visual models and achieves excellent performance in multiple visual tasks.

AFIDAF: Alternating Fourier and Image Domain Adaptive Filters as an Efficient Alternative to Attention in ViTs

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

FViT: A Focal Vision Transformer with Gabor Filter

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias

DBFFT: Adversarial-robust dual-branch frequency domain feature fusion in vision transformers

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

Improving Vision Transformers by Revisiting High-Frequency Components

FasterViT: Fast Vision Transformers with Hierarchical Attention

FMViT: A multiple-frequency mixing Vision Transformer

FilterViT and DropoutViT: Lightweight Vision Transformer Models for Efficient Attention Mechanisms

A FREE LUNCH FROM VIT: ADAPTIVE ATTENTION MULTI-SCALE FUSION TRANSFORMER FOR FINE-GRAINED VISUAL RECOGNITION

Vicinity Vision Transformer

A free lunch from ViT:Adaptive Attention Multi-scale Fusion Transformer for Fine-grained Visual Recognition

Fusion of regional and sparse attention in Vision Transformers

Fast Vision Transformers with HiLo Attention

LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient Image Recognition

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

Vision Transformer with Attention Map Hallucination and FFN Compaction