HFSI-TF: Hierarchical Full-Scale Interactive Transformer Model for Object Detection in Remote Sensing Image

Daxiang Li,Bingying Li,Ying Liu
DOI: https://doi.org/10.1109/lgrs.2024.3482693
IF: 5.343
2024-11-01
IEEE Geoscience and Remote Sensing Letters
Abstract:Transformer-based object detection models usually adopt an encoding-decoding architecture that mainly combines self-attention (SA) and multilayer perceptron (MLP). Although this architecture does not require nonmaximum suppression (NMS) and can really achieve end-to-end object detection, it also suffers from the disadvantage of insufficient multiscale object perception in the image, which leads to low accuracy in detecting small objects. Focusing on these issues, a new full-scale bidirectional interactive attention (FSBDIA) mechanism is constructed, thereby a novel hierarchical full-scale interactive transformer (HFSI-TF) model is designed for object detection in remote sensing image (RSI). First, in order to enhance the multiscale perception ability of the model, the FSBDIA mechanism is designed under the guidance of full-scale information. Then, based on FSBDIA, a hierarchical HFSI-TF encoder is constructed to interactively fuse multilayer feature maps layer by layer, thereby obtaining multiscale encoded features of RSI. Finally, a mixed cross attention (MCA) mechanism is also constructed, and an iterative decoding architecture is designed based on it to improve the accuracy of small object detection. Comparative experiments based on two benchmark datasets (i.e., DIOR and HRSC2016) show that the designed HFSI-TF model can effectively improve the accuracy of object detection in RSI, and the model we designed has superior performance compared to other state-of-the-art methods.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?