AFIDAF: Alternating Fourier and Image Domain Adaptive Filters as an Efficient Alternative to Attention in ViTs

Yunling Zheng,Zeyi Xu,Fanghui Xue,Biao Yang,Jiancheng Lyu,Shuai Zhang,Yingyong Qi,Jack Xin
2024-09-26
Abstract:We propose and demonstrate an alternating Fourier and image domain filtering approach for feature extraction as an efficient alternative to build a vision backbone without using the computationally intensive attention. The performance among the lightweight models reaches the state-of-the-art level on ImageNet-1K classification, and improves downstream tasks on object detection and segmentation consistently as well. Our approach also serves as a new tool to compress vision transformers (ViTs).
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of lightweight visual models, especially in computer vision tasks such as image classification, object detection, and semantic segmentation. Specifically, the author proposes an Alternating Fourier and Image - Domain Adaptive Filtering (AFIDAF) method to replace the computationally - intensive attention mechanism, thereby constructing an efficient visual backbone network. ### Main problems: 1. **Reduce computational complexity**: Although existing Vision Transformers (ViTs) outperform Convolutional Neural Networks (CNNs) in performance, their computational complexity is relatively high, especially in the attention mechanism part. This makes them difficult to be applied in resource - constrained environments or mobile devices. 2. **Improve the performance of lightweight models**: Although some existing lightweight models (such as MobileNets, ShuffleNets, etc.) have fewer parameters, their performance in some tasks is not as good as that of large - scale models. Therefore, how to improve their performance while maintaining the model's lightweight nature is an important research direction. 3. **Improve the limitations of AFFNet**: AFFNet (Adaptive Frequency Filters as Efficient Global Token Mixers) realizes global feature mixing through Fourier transform. However, in actual implementation, its channel - dimension filtering limits its performance in the frequency domain, resulting in poor performance in high - resolution or dense prediction tasks. ### Solutions proposed in the paper: - **Alternating Fourier and Image - Domain Adaptive Filtering (AFIDAF)**: By performing adaptive filtering alternately between the Fourier domain and the image domain, it combines the advantages of large - kernel convolution and Fourier transform, which can not only extract features locally but also perform effective feature mixing globally. This method not only improves the performance of the model but also maintains the lightweight characteristics of the model. - **Hierarchical AFIDAF (HAFIDAF)**: In order to further compress the ViT model, the author proposes a hierarchical AFIDAF framework based on Swin Transformer, which reduces the number of parameters while maintaining the high performance of the model in tasks such as image classification, object detection, and semantic segmentation. ### Experimental results: - **Image classification**: AFIDAF achieves a Top - 1 accuracy rate of 80.9% on the ImageNet - 1K dataset, which is better than other lightweight models, and has only 6.5M parameters and 1.5G FLOPs. - **Object detection**: On the MS - COCO dataset, the mAP of AFIDAF reaches 30.2%, which is significantly better than other lightweight detectors. - **Semical segmentation**: On the PASCAL VOC 2012 dataset, the mIoU of AFIDAF reaches 81.6%, which is also better than other lightweight models. In conclusion, by introducing the AFIDAF method, this paper successfully solves the balance problem between the performance and computational efficiency of lightweight visual models and achieves excellent performance in multiple visual tasks.