Abstract:Recently, transformer models have been introduced into the field of remote sensing image object detection, benefiting from their ability to model long-term information. However, the existing transformer-based object detection methods mainly consider the global interaction of local elements and have a limited ability to enhance the local information, which can bring some difficulties in distinguishing real objects and a complex background. In this letter, a query-enhanced transformer (QETR) model is proposed to solve the above problems. The proposed model consists of three main parts: an encoder, a decoder, and a detection head. A Swin transformer is used to extract deep features in the encoder. In the decoder, the object and anchor queries are initialized and the feature and position information of the objects is learned by the multihead self-attention (MHSA) and cross-attention mechanisms, respectively. Furthermore, a query align (QA) module along with a scale controller are proposed to enhance the object information around the local queries by limiting the attention to a certain range without losing important information. Finally, the boundaries and types of the objects are acquired from the detection head based on bipartite matching. To verify the effectiveness of the proposed method, comparative experiments were carried out with other state-of-the-art methodologies on two public datasets: the High-Resolution Remote Sensing Detection (HRRSD) dataset and the object detection in optical remote sensing images (DIOR) dataset. The experimental results confirm the effectiveness and superiority of the QETR model, which achieved 71.5% and 91.1% mean average precision (mAP) values on the DIOR and HRRSD datasets, respectively.

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Neural Network based End-to-End Query by Example Spoken Term Detection

Query-by-Example Spoken Term Detection using Attentive Pooling Networks

Query-by-example Spoken Term Detection Based on Phonetic Posteriorgram

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

Query-by-example Spoken Term Detection using Attention-based Multi-hop Networks

Siamese Recurrent Auto-Encoder Representation for Query-by-Example Spoken Term Detection

Learning Frame-Level Recurrent Neural Networks Representations for Query-by-Example Spoken Term Detection on Mobile Devices

Transformer-based encoder-encoder architecture for Spoken Term Detection

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

A Nonparametric Bayesian Approach for Spoken Term detection by Example Query

QETR: A Query-Enhanced Transformer for Remote Sensing Image Object Detection

Cross-lingual and Multilingual Spoken Term Detection for Low-Resource Indian Languages

Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping

Semantic query-by-example speech search using visual grounding

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

Weighted fast sequential DTW for multilingual audio Query-by-Example retrieval

BEST-STD: Bidirectional Mamba-Enhanced Speech Tokenization for Spoken Term Detection

Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder

Unsupervised Discovery of Structured Acoustic Tokens with Applications to Spoken Term Detection