Abstract:Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks. Code is available at <a class="link-external link-https" href="https://github.com/LeapLabTHU/Slide-Transformer" rel="external noopener nofollow">this https URL</a>.

GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition

Scene Chinese Recognition with Local and Global Attention

MASTER: Multi-Aspect Non-local Network for Scene Text Recognition

Flexible scene text recognition based on dual attention mechanism

Heterogeneous Attention Based Transformer for Sign Language Translation

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

FACLSTM: ConvLSTM with focused attention for scene text recognition

Local-to-Global Self-Attention in Vision Transformers

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

LGAFormer: transformer with local and global attention for action detection

Lite Vision Transformer with Enhanced Self-Attention

Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers

TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance

ViTSTR-Transducer: Cross-Attention-Free Vision Transformer Transducer for Scene Text Recognition

DPTNet: A Dual-Path Transformer Architecture for Scene Text Detection

PLG-ViT: Vision Transformer with Parallel Local and Global Self-Attention

Scene Text Recognition with Cascade Attention Network.

Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention

Attention-Guided Spatial Transformer Networks for Fine-Grained Visual Recognition

Pure Transformer with Integrated Experts for Scene Text Recognition

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning