Abstract:Vision transformers have shown great success on numerous computer vision tasks. However, their central component, softmax attention, prohibits vision transformers from scaling up to high-resolution images, due to both the computational complexity and memory footprint being quadratic. Linear attention was introduced in natural language processing (NLP) which reorders the self-attention mechanism to mitigate a similar issue, but directly applying existing linear attention to vision may not lead to satisfactory results. We investigate this problem and point out that existing linear attention methods ignore an inductive bias in vision tasks, i.e., 2D locality. In this paper, we propose Vicinity Attention, which is a type of linear attention that integrates 2D locality. Specifically, for each image patch, we adjust its attention weight based on its 2D Manhattan distance from its neighbouring patches. In this case, we achieve 2D locality in a linear complexity where the neighbouring image patches receive stronger attention than far away patches. In addition, we propose a novel Vicinity Attention Block that is comprised of Feature Reduction Attention (FRA) and Feature Preserving Connection (FPC) in order to address the computational bottleneck of linear attention approaches, including our Vicinity Attention, whose complexity grows quadratically with respect to the feature dimension. The Vicinity Attention Block computes attention in a compressed feature space with an extra skip connection to retrieve the original feature distribution. We experimentally validate that the block further reduces computation without degenerating the accuracy. Finally, to validate the proposed methods, we build a linear vision transformer backbone named Vicinity Vision Transformer (VVT). Targeting general vision tasks, we build VVT in a pyramid structure with progressively reduced sequence length. We perform extensive experiments on CIFAR-100, ImageNet-1 k, and ADE20 K datasets to validate the effectiveness of our method. Our method has a slower growth rate in terms of computational overhead than previous transformer-based and convolution-based networks when the input resolution increases. In particular, our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous approaches.

Linearly-evolved Transformer for Pan-sharpening

SSETPAN: Spatial-Spectral Enhanced Transformer Based Network for Pansharpening

PanFormer: a Transformer Based Model for Pan-sharpening

Effective Pan-Sharpening with Transformer and Invertible Neural Network

Pan-Sharpening with Customized Transformer and Invertible Neural Network

FLatten Transformer: Vision Transformer using Focused Linear Attention

STCP: Synergistic Transformer and Convolutional Neural Network for Pansharpening

Local-Global Transformer Enhanced Unfolding Network for Pan-sharpening

Vicinity Vision Transformer

Transformer-based dual path cross fusion for pansharpening remote sensing images

Transformer-Based Dual-Branch Multiscale Fusion Network for Pan-Sharpening Remote Sensing Images

FAM: Improving columnar vision transformer with feature attention mechanism

A novel pansharpening method based on cross stage partial network and transformer

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

An efficient multi‐scale transformer for satellite image dehazing

DPP: Scale-Generalization Transformer Based on Dynamic Projection for Pansharpening

Dynamic Grained Encoder for Vision Transformers

Transformer-based adaptive 3D residual CNN with sparse representation for PAN-sharpening of multispectral images

CMT: Cross Modulation Transformer with Hybrid Loss for Pansharpening

VisionTwinNet: Gated Clarity Enhancement Paired With Light-Robust CD Transformers