Abstract:Object detection in remote-sensing images (RSIs) is always a vibrant research topic in the remote-sensing community. Recently, deep-convolutional-neural-network (CNN)-based methods, including region-CNN-based and You-Only-Look-Once-based methods, have become the de-facto standard for RSI object detection. CNNs are good at local feature extraction but they have limitations in capturing global features. However, the attention-based transformer can obtain the relationships of RSI at a long distance. Therefore, the Transformer for Remote-Sensing Object detection (TRD) is investigated in this study. Specifically, the proposed TRD is a combination of a CNN and a multiple-layer Transformer with encoders and decoders. To detect objects from RSIs, a modified Transformer is designed to aggregate features of global spatial positions on multiple scales and model the interactions between pairwise instances. Then, due to the fact that the source data set (e.g., ImageNet) and the target data set (i.e., RSI data set) are quite different, to reduce the difference between the data sets, the TRD with the transferring CNN (T-TRD) based on the attention mechanism is proposed to adjust the pre-trained model for better RSI object detection. Because the training of the Transformer always needs abundant, well-annotated training samples, and the number of training samples for RSI object detection is usually limited, in order to avoid overfitting, data augmentation is combined with a Transformer to improve the detection performance of RSI. The proposed T-TRD with data augmentation (T-TRD-DA) is tested on the two widely-used data sets (i.e., NWPU VHR-10 and DIOR) and the experimental results reveal that the proposed models provide competitive results (i.e., centuple mean average precision of 87.9 and 66.8 with at most 5.9 and 2.4 higher than the comparison methods on the NWPU VHR-10 and the DIOR data sets, respectively) compared to the competitive benchmark methods, which shows that the Transformer-based method opens a new window for RSI object detection.

SRDD: a lightweight end-to-end object detection with transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

DFS-DETR: Detailed-Feature-Sensitive Detector for Small Object Detection in Aerial Images Using Transformer

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

An Extendable, Efficient and Effective Transformer-based Object Detector

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers

MCG-RTDETR: Multi-Convolution and Context-Guided Network with Cascaded Group Attention for Object Detection in Unmanned Aerial Vehicle Imagery

Remote Sensing Object Detection Based on Strong Feature Extraction and Prescreening Network

Efficient Decoder-Free Object Detection with Transformers

AF-DETR: efficient UAV small object detector via Assemble-and-Fusion mechanism

L-DETR: A Light-Weight Detector for End-to-End Object Detection With Transformers

DV-DETR: Improved UAV Aerial Small Target Detection Algorithm Based on RT-DETR

Transformer with Transfer CNN for Remote-Sensing-Image Object Detection

Aerial Image Object Detection With Vision Transformer Detector (ViTDet)

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

Drone-DETR: Efficient Small Object Detection for Remote Sensing Image Using Enhanced RT-DETR Model

Deformable DETR: Deformable Transformers for End-to-End Object Detection

SpeedDETR: Speed-aware Transformers for End-to-end Object Detection.

AODet: Aerial Object Detection Using Transformers for Foreground Regions

SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection