Abstract:Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. Recently, end-to-end transformer-based methods have achieved success by eliminating the need for post-processing operators compared to traditional CNN-based methods. However, directly extending transformers to oriented object detection presents three main issues: 1) objects rotate arbitrarily, necessitating the encoding of angles along with position and size; 2) the geometric relations of oriented objects are lacking in self-attention, due to the absence of interaction between content and positional queries; and 3) oriented objects cause misalignment, mainly between values and positional queries in cross-attention, making accurate classification and localization difficult. In this paper, we propose an end-to-end transformer-based oriented object detector, consisting of three dedicated modules to address these issues. First, Gaussian positional encoding is proposed to encode the angle, position, and size of oriented boxes using Gaussian distributions. Second, Wasserstein self-attention is proposed to introduce geometric relations and facilitate interaction between content and positional queries by utilizing Gaussian Wasserstein distance scores. Third, oriented cross-attention is proposed to align values and positional queries by rotating sampling points around the positional query according to their angles. Experiments on six datasets DIOR-R, a series of DOTA, HRSC2016 and ICDAR2015 show the effectiveness of our approach. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_{50}$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$\times$ to 1$\times$. The codes are available at <a class="link-external link-https" href="https://github.com/wokaikaixinxin/OrientedFormer" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when performing oriented object detection in remote sensing images, since objects are distributed in multiple orientations, traditional methods based on convolutional neural networks (CNNs) are difficult to accurately classify and locate these objects. Specifically, directly extending the Transformer framework to oriented object detection faces three main problems: 1. **Angle Encoding**: Objects can rotate arbitrarily, so it is necessary to encode angle, position and size information simultaneously. However, existing Transformer methods only encode position and size, ignoring angle information. 2. **Lack of Geometric Relationships**: There is a lack of modeling of geometric relationships of oriented objects in the self - attention mechanism. There is no interaction between content queries and position queries, resulting in the inability to capture geometric relationships. 3. **Misalignment Problem**: Since objects can rotate arbitrarily and multi - scale image features have a pyramid structure, this usually leads to misalignment between values and position queries in cross - attention, making it difficult to accurately classify and locate target objects. To solve these problems, the paper proposes an end - to - end Transformer - based oriented object detection framework - OrientedFormer, and introduces three specialized modules to address the above challenges: 1. **Gaussian Positional Encoding (PE)**: By converting the oriented box into a Gaussian distribution, it uniformly encodes angle, position and size information, solving the angle encoding problem. 2. **Wasserstein Self - Attention**: Using the Gaussian Wasserstein distance score to measure the geometric relationships between different content queries, enabling content queries and position queries to interact with each other, solving the problem of lack of geometric relationships. 3. **Oriented Cross - Attention**: By aligning values and position queries according to angle - rotated sampling points, the misalignment problem is solved. Experimental results show that OrientedFormer has achieved significant performance improvements on multiple datasets and reduced the number of training rounds.

OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images

Oriented Object Detection with Transformer

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Efficient Inductive Vision Transformer for Oriented Object Detection in Remote Sensing Imagery

Orientation-First Strategy With Angle Attention Module for Rotated Object Detection in Remote Sensing Images

Learning RoI Transformer for Oriented Object Detection in Aerial Images

Transformer-Based Multi-layer Feature Aggregation and Rotated Anchor Matching for Oriented Object Detection in Remote Sensing Images

On Improving Bounding Box Representations for Oriented Object Detection

An Improved DETR Based on Angle Denoising and Oriented Boxes Refinement for Remote Sensing Object Detection

DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query

RiDOP: A Rotation-Invariant Detector with Simple Oriented Proposals in Remote Sensing Images

Learning RoI Transformer for Detecting Oriented Objects in Aerial Images

Oriented Object Detector with Gaussian Distribution Cost Label Assignment and Task-Decoupled Head

Spatial Transform Decoupling for Oriented Object Detection

Hierarchical Mask Prompting and Robust Integrated Regression for Oriented Object Detection

A Refined Single-Stage Detector With Feature Enhancement and Alignment for Oriented Objects

ADT-Det: Adaptive Dynamic Refined Single-Stage Transformer Detector for Arbitrary-Oriented Object Detection in Satellite Optical Imagery

Dual-Aligned Oriented Detector

Task-Aligned Oriented Object Detection in Remote Sensing Images

Oriented objects as pairs of middle lines

Feature Enhancement Based Oriented Object Detection in Remote Sensing Images