Abstract:In this paper, an improved detector MRTMDet, is proposed to overcome the complex backgrounds noise and large scale‐variations challenge for oriented object detection in remote sensing images by designing innovative feature extraction network and feature fusion network. These networks integrate a lightweight vision transformer and a multi‐scale feature extraction module in different structures, thereby enhancing the overall quality of feature representation and the effectiveness in understanding and predicting tasks and further augmenting the model's ability to perceive both global features and multi‐scale features. The authors set the ablation and comparison experiments on the publicly available dataset DIOR‐R which show the model achieves excellent comprehensive performance and is well‐balanced with precision and lightweight. Object detection in remote sensing images aims to interpret images to obtain information on the category and location of potential targets, which is of great importance in traffic detection, marine supervision, and space reconnaissance. However, the complex backgrounds and large scale variations in remote sensing images present significant challenges. Traditional methods relied mainly on image filtering or feature descriptor methods to extract features, resulting in underperformance. Deep learning methods, especially one‐stage detectors, for example, the Real‐Time Object Detector (RTMDet) offers advanced solutions with efficient network architectures. Nevertheless, difficulty in feature extraction from complex backgrounds and target localisation in scale variations images limits detection accuracy. In this paper, an improved detector based on RTMDet, called the Multi‐Scale Feature Extraction‐assist RTMDet (MRTMDet), is proposed which address limitations through enhancement feature extraction and fusion networks. At the core of MRTMDet is a new backbone network MobileViT++ and a feature fusion network SFC‐FPN, which enhances the model's ability to capture global and multi‐scale features by carefully designing a hybrid feature processing unit of CNN and a transformer based on vision transformer (ViT) and poly‐scale convolution (PSConv), respectively. The experiment in DIOR‐R demonstrated that MRTMDet achieves competitive performance of 62.2% mAP, balancing precision with a lightweight design.

Multi‐Scale Feature Attention‐DEtection TRansformer: Multi‐Scale Feature Attention for security check object detection

Multi-Object Detection in Security Screening Scene Based on Convolutional Neural Network

DETR++: Taming Your Multi-Scale Detection Transformer

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

Multi-scale Attention and Dilation Network for Small Defect Detection

Multi‐scale feature extraction for energy‐efficient object detection in remote sensing images

Remote Sensing Object Detection Based on Strong Feature Extraction and Prescreening Network

DAF-Net: dense attention feature pyramid network for multiscale object detection

Self-Paced Feature Attention Fusion Network for Concealed Object Detection in Millimeter-Wave Image

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

Multi-scale Fusion Based Multi-stage Small Object Detection in Aerial Images ∗

3D Object Detection Based on Attention and Multi-Scale Feature Fusion

FFR-SSD: feature fusion and reconstruction single shot detector for multi-scale object detection

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

CA2Det: Cascaded Adaptive Fusion Pyramid Network Based on Attention Mechanism for Small Object Detection

Improving Multispectral Pedestrian Detection with Scale‐aware Permutation Attention and Adjacent Feature Aggregation

MilDetr: Detection Transformer for Military Camouflaged Target Detection

DFAM-DETR: Deformable Feature Based Attention Mechanism DETR on Slender Object Detection

Two Cases of Sinusitis Induced by Immune Checkpoint Inhibition.

M2Det: A Single-Shot Object Detector Based on Multi-Level Feature Pyramid Network.

Scale-Adaptive Salience Supervision and Dynamic Token Filtering for Small Object Detection in Remote Sensing Images