Abstract:Object detection is a fundamental task in remote sensing image analysis and scene understanding. Previous remote sensing object detectors are typically based on convolutional neural networks (CNNs), whose performance is significantly limited by the intrinsic locality of convolution operations. The emergence of vision Transformers brings potential solutions to this problem, which has the capability to be a solid alternative to CNNs. However, three crucial obstacles hinder the application and performance of Transformers in the task of remote sensing object detection, that is: 1) high computational complexity, especially for high-resolution remote sensing images; 2) training and sample inefficiency caused by lack of inductive bias; and 3) difficulty in learning arbitrary orientation knowledge of geospatial objects. To address these issues, in this article, a novel efficient inductive vision Transformer framework is proposed for oriented object detection in remote sensing imagery. This framework follows the hierarchical feature pyramid structure and makes threefold contributions as follows: 1) spatial redundancy in remote sensing images is fully explored and an adaptive multigrained routing mechanism is proposed to facilitate token sparsity in Transformers, which can dramatically reduce the computational cost without comprising the accuracy. 2) A compact dual-path encoding architecture, where both global long-range dependencies and local semantic relations are jointly and complementarily captured, is proposed to enhance inductive bias in Transformers. 3) An angle tokenization technique is proposed to promote the encoding, embedding, and learning of direction knowledge for oriented objects in remote sensing scenarios. In this work, the above-mentioned three contributions are instantiated in an advanced Transformer-based object detector, namely, EIA-pyramid vision Transformer (PVT). Comprehensive experiments on two publicly available datasets have demonstrated its effectiveness and superiority for oriented object detection in remote sensing images.

Progressive Learning Vision Transformer for Open Set Recognition of Fine-Grained Objects in Remote Sensing Images.

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

A Vision Transformer Architecture for Open Set Recognition

Open Set Recognition using Vision Transformer with an Additional Detection Head

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing

Reperceive Global Vision of Transformer for Remote Sensing Images Weakly Supervised Object Localization

Vision Transformer with Progressive Sampling

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Advancing Plain Vision Transformer Toward Remote Sensing Foundation Model

PCViT: A Pyramid Convolutional Vision Transformer Detector for Object Detection in Remote-Sensing Imagery

Remote Sensing Scene Classification Based on Local Selection Vision Transformer

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

ZoomViT: an Observation Behavior-Based Fine-Grained Recognition Scheme

ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator

Vision Transformer With Contrastive Learning for Remote Sensing Image Scene Classification

Adaptive Spatial Tokenization Transformer for Salient Object Detection in Optical Remote Sensing Images.

A Vision Transformer for Fine-Grained Classification by Reducing Noise and Enhancing Discriminative Information

OSR-ViT: A Simple and Modular Framework for Open-Set Object Detection and Discovery

Part-Guided Relational Transformers for Fine-Grained Visual Recognition