Abstract:Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at <a class="link-external link-https" href="https://github.com/zhangzjn/EATFormer" rel="external noopener nofollow">this https URL</a>.

Illumination Adaptive Transformer.

You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction

IATN: illumination-aware two-stage network for low-light image enhancement

A Transformer-Based Network for Low-Light Image Enhancement

IGT: Illumination-guided RGB-T object detection with transformers

Lite Vision Transformer with Enhanced Self-Attention

Illumination-Aware Low-Light Image Enhancement with Transformer and Auto-Knee Curve

Pre‐trained low‐light image enhancement transformer

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

A non-uniform low-light image enhancement method with multi-scale attention transformer and luminance consistency loss

IRISformer: Dense Vision Transformers for Single-Image Inverse Rendering in Indoor Scenes

Illumination-aware window transformer for rgbt modality fusion

Implicit Multi-Spectral Transformer: An Lightweight and Effective Visible to Infrared Image Translation Model

Vision Transformer with Sparse Scan Prior

Image attention transformer network for indoor 3D object detection

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method

Self-Aligned Concave Curve: Illumination Enhancement for Unsupervised Adaptation

IPT-V2: Efficient Image Processing Transformer using Hierarchical Attentions