A Spatial Frequency Fusion Strategy Based on Linguistic Query Refinement for RSVG

Enyuan Zhao†,Ziyi Wan†,Ze Zhang,Jie Nie,Xinyue Liang,Lei Huang
DOI: https://doi.org/10.1109/tgrs.2024.3471082
IF: 8.2
2024-01-01
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote Sensing Visual Grounding (RSVG) represents a pivotal task aimed at pinpointing regions highly relevant to textual descriptions by parsing remote sensing image content. Existing RSVG methodologies focus on multimodal semantic integration on specific datasets, overlooking the intricate high-dimensional information present in remote sensing images, where color, scale, and semantics are tightly coupled. Consequently, these approaches exhibit limitations in handling specific receptive fields or preserving structural information within remote sensing images, leading to insufficient localization precision. To address this issue, this paper introduces a strategy that incorporates spatial frequency information, leveraging Fourier transforms to capture global structured information from remote sensing data. This is followed by a progressive aggregation across spatial, spectral, and linguistic modalities to achieve robust semantic coreference. The primary contributions of this paper are as follows: Firstly, a spatial-frequency fusion strategy based on linguistic query refinement is proposed, which enhances visual grounding performance significantly by extracting spectral features with potent spatial perception capabilities through the expansion of the spectral receptive field. Secondly, a frequency-guided spatial module is designed, utilizing amplitude and phase-structured spectral features to augment spatial representation capabilities further. Lastly, a query-aware original attention mechanism is developed, facilitating deep integration of spatial and spectral information under linguistic guidance. Extensive experimentation on the RSVGD dataset validates the efficacy of the proposed approach, demonstrating superior performance when compared to state-of-the-art methods.
What problem does this paper attempt to address?