Abstract:Since the global self-attention mechanism can capture long-distance dependencies well, Transformer-based methods have achieved remarkable performance in many vision tasks, including single-image super-resolution (SISR). However, there are strong local self-similarities in images, if the global self-attention mechanism is still used for image processing, it may lead to excessive use of computing resources on parts of the image with weak correlation. Especially in the high-resolution large-size image, the global self-attention will lead to a large number of redundant calculations. To solve this problem, we propose the Enhanced Local Multi-windows Attention Network (ELMA), which contains two main designs. First, different from the traditional self-attention based on square window partition, we propose a Multi-windows Self-Attention (M-WSA) which uses a new window partitioning mechanism to obtain different types of local long-distance dependencies. Compared with original self-attention mechanisms commonly used in other SR networks, M-WSA reduces computational complexity and achieves superior performance through analysis and experiments. Secondly, we propose a Spatial Gated Network (SGN) as a feed-forward network, which can effectively overcome the problem of intermediate channel redundancy in traditional MLP, thereby improving the parameter utilization and computational efficiency of the network. Meanwhile, SGN introduces spatial information into the feed-forward network that traditional MLP cannot obtain. It can better understand and use the spatial structure information in the image, and enhances the network performance and generalization ability. Extensive experiments show that ELMA achieves competitive performance compared to state-of-the-art methods while maintaining fewer parameters and computational costs.

LMSA: Low-Relation Multi-head Self-attention Mechanism in Visual Transformer

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Lite Vision Transformer with Enhanced Self-Attention

Constituent Attention for Vision Transformers

ELSA: Enhanced Local Self-Attention for Vision Transformer

Improving Vision Transformers by Overlapping Heads in Multi-Head Self-Attention

Adapting LLaMA Decoder to Vision Transformer

Vision Transformers with Hierarchical Attention

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

Local-to-Global Self-Attention in Vision Transformers

MAFormer: A transformer network with multi-scale attention fusion for visual recognition

What Limits the Performance of Local Self-attention?

Transformer with sparse self‐attention mechanism for image captioning

Efficient Visual Transformer by Learnable Token Merging

Low-Resolution Self-Attention for Semantic Segmentation

Accumulated Trivial Attention Matters in Vision Transformers on Small Datasets

Conv-Attention: A Low Computation Attention Calculation Method for Swin Transformer

Enhanced local multi-windows attention network for lightweight image super-resolution

Transformer-BLS: An efficient learning algorithm based on multi-head attention mechanism and incremental learning algorithms

FAM: Improving columnar vision transformer with feature attention mechanism