Abstract:In the realm of object detection from high-resolution remote sensing images (HRRSIs), the existing YOLOv5 methods encounter several challenges, including dense object arrangements, small object sizes, and complex backgrounds. To tackle these challenges, we propose a novel approach called C3TB-YOLOv5, which combines traditional YOLOv5 with the Transformer model to detect objects in HRRSIs. Unlike conventional YOLOv5 methods that primarily focus on capturing local information from remote sensing scenes, our C3TB-YOLOv5 method incorporates global information through the introduction of a new C3TB module. This module, based on the Transformer multi-head attention mechanism (AM), consists of two branches that extract local and global information from feature maps. By integrating these branches and establishing long-range relationships, our method successfully detects densely arranged small objects in HRRSIs. Furthermore, to improve the accuracy of tiny object detection, a novel detection head has been developed to effectively utilize the unused C3 module, thereby preventing the loss of fine-grained textures and positional features. In addition, we integrate an enhanced SimAM, namely Sim-GMP, into the model to adjust the focus across varying regions, effectively distinguishing the features of interested objects from complex backgrounds. Finally, to address the problem of sample imbalance in remote sensing object detection, the most recent Wise-IoU v3 loss function is employed to improve the accuracy of anchor box predictions for objects. To maintain a high object detection speed, the most critical C3 modules are substituted with the proposed C3TB module for the purpose of striking a good balance between object detection accuracy and model lightweight. Extensive experiments conducted on two remote sensing datasets of NWPU VHR-10 and VisDrone 2019 demonstrates that our method achieves superior object detection performance than state-of-the-art methods.

CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection.

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Cascade-DETR: Delving into High-Quality Universal Object Detection

Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments

CAT: A Simple yet Effective Cross-Attention Transformer for One-Shot Object Detection

A Convolution with Transformer Attention Module Integrating Local and Global Features for Object Detection in Remote Sensing Based on YOLOv8n

Category-Aware Transformer Network for Better Human-Object Interaction Detection

End-to-End Object Detection with Adaptive Clustering Transformer

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

CAT: a Coarse-to-fine Attention Tree for Semantic Change Detection

Text-Guided Unknown Pseudo-Labeling for Open-World Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Simple Image-level Classification Improves Open-vocabulary Object Detection

Annealing-Based Label-Transfer Learning for Open World Object Detection

Transformer with large convolution kernel decoder network for salient object detection in optical remote sensing images

OcTr: Octree-based Transformer for 3D Object Detection

Unsupervised Recognition of Unknown Objects for Open-World Object Detection

Category-Extensible Out-of-Distribution Detection via Hierarchical Context Descriptions

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

C3TB-YOLOv5: integrated YOLOv5 with transformer for object detection in high-resolution remote sensing images

CoDA: Collaborative Novel Box Discovery and Cross-modal Alignment for Open-vocabulary 3D Object Detection