Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

Ke Li,Di Wang,Haojie Xu,Haodi Zhong,Cong Wang
DOI: https://doi.org/10.1109/tgrs.2024.3423663
IF: 8.2
2024-07-16
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Visual grounding in remote sensing (RSVG) images aims to detect specific objects associated with referring expressions in remote sensing images. Existing methods typically combine outputs of pretrained visual and linguistic backbones to locate referred objects. However, due to the lack of interaction with the language modality during the visual feature extraction process, the visual backbone may suffer from attention drift, limiting RSVG's performance. To avoid this, we propose a novel RSVG framework, namely, language-guided progressive visual attention (LPVA), which achieves precise attention on referred objects by adjusting visual features with a progressive attention (PA) module and a multilevel feature enhancement (MFE) decoder. Specifically, the former can dynamically generate multiscale weights and biases, enabling the visual backbone to gradually focus on expression-related features at spatial and channel levels. The latter is designed to aggregate visual contextual information of the referred objects to enhance features' distinctiveness while simultaneously suppressing information of irrelevant regions. To thoroughly examine the localization capability of RSVG models, we construct a new large-scale benchmark dataset, namely, OPT-RSVG, which poses challenges in comprehensive understanding among complex scenarios. Experimental results show that the proposed method pushes the accuracy score to 82.27% (6.29% absolute improvement) on the DIOR-RSVG dataset and 78.03% on the OPT-RSVG dataset, thus setting new records. The source codes of the proposed method and OPT-RSVG dataset are available at https://github.com/like413/OPT-RSVG.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?