EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Jiangning Zhang,Xiangtai Li,Yabiao Wang,Chengjie Wang,Yibo Yang,Yong Liu,Dacheng Tao
2024-08-11
Abstract:Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at <a class="link-external link-https" href="https://github.com/zhangzjn/EATFormer" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Emerging Technologies
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper attempts to explain the rationale behind Vision Transformers (ViT) through an analogy with Evolutionary Algorithms (EA) and proposes a new improved model based on this analogy—EATFormer. #### Main Objectives: 1. **Explain the effectiveness of ViT**: Explain why Vision Transformers are effective by drawing an analogy with proven effective Evolutionary Algorithms. 2. **Propose a new architecture**: Inspired by variants of Evolutionary Algorithms, design a novel pyramid-structured Vision Transformer model, EATFormer. 3. **Enhance performance**: Demonstrate superior performance over existing State-Of-The-Art (SOTA) methods in tasks such as image classification, object detection, and semantic segmentation. #### Specific Contributions: 1. **Theoretical Contribution**: Enrich the theoretical understanding of the rationale behind Vision Transformers through evolutionary explanations and derive mathematical formulas consistent with Evolutionary Algorithms. 2. **Framework Contribution**: Propose a new evolutionary-based transformer block (EAT block), which includes three residual parts for modeling multi-scale information, feature interaction, and individual enhancement. 3. **Method Contribution**: Design four modules to improve the effectiveness and usability of EATFormer: Global-Local Interaction module (GLI), Multi-Scale Region Aggregation module (MSRA), Task-Related Head module (TRH), and Modulated Deformable Multi-Head Self-Attention module (MD-MSA). 4. **Experimental Validation**: Extensive experiments demonstrate the superiority and efficiency of EATFormer in various vision tasks, and ablation studies further validate the effectiveness of its components.