EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Jiangning Zhang,Xiangtai Li,Yabiao Wang,Chengjie Wang,Yibo Yang,Yong Liu,Dacheng Tao

2024-08-11

Abstract:Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at <a class="link-external link-https" href="https://github.com/zhangzjn/EATFormer" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Emerging Technologies

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper attempts to explain the rationale behind Vision Transformers (ViT) through an analogy with Evolutionary Algorithms (EA) and proposes a new improved model based on this analogy—EATFormer. #### Main Objectives: 1. **Explain the effectiveness of ViT**: Explain why Vision Transformers are effective by drawing an analogy with proven effective Evolutionary Algorithms. 2. **Propose a new architecture**: Inspired by variants of Evolutionary Algorithms, design a novel pyramid-structured Vision Transformer model, EATFormer. 3. **Enhance performance**: Demonstrate superior performance over existing State-Of-The-Art (SOTA) methods in tasks such as image classification, object detection, and semantic segmentation. #### Specific Contributions: 1. **Theoretical Contribution**: Enrich the theoretical understanding of the rationale behind Vision Transformers through evolutionary explanations and derive mathematical formulas consistent with Evolutionary Algorithms. 2. **Framework Contribution**: Propose a new evolutionary-based transformer block (EAT block), which includes three residual parts for modeling multi-scale information, feature interaction, and individual enhancement. 3. **Method Contribution**: Design four modules to improve the effectiveness and usability of EATFormer: Global-Local Interaction module (GLI), Multi-Scale Region Aggregation module (MSRA), Task-Related Head module (TRH), and Modulated Deformable Multi-Head Self-Attention module (MD-MSA). 4. **Experimental Validation**: Extensive experiments demonstrate the superiority and efficiency of EATFormer in various vision tasks, and ablation studies further validate the effectiveness of its components.

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Improved EATFormer: A Vision Transformer for Medical Image Classification

Analogous to Evolutionary Algorithm: Designing a Unified Sequence Model

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

big.LITTLE Vision Transformer for Efficient Visual Recognition

AutoFormer: Searching Transformers for Visual Recognition

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

ACC-ViT : Atrous Convolution's Comeback in Vision Transformers

ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond

EVA-02: A Visual Representation for Neon Genesis

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

A novel dual-granularity lightweight transformer for vision tasks

ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding

Lite Vision Transformer with Enhanced Self-Attention

Vision Transformer with Sparse Scan Prior