Multi-Stage Synergistic Aggregation Network for Remote Sensing Visual Grounding
Fuyan Wang,Chunlei Wu,Jie Wu,Leiquan Wang,Canwei Li
DOI: https://doi.org/10.1109/lgrs.2024.3360473
IF: 5.343
2024-01-01
IEEE Geoscience and Remote Sensing Letters
Abstract:Visual Grounding has a broad application prospect in the field of remote sensing. Current state-of-the-art methods predominantly are based on the transformer architecture, utilizing multi-head self-attention in multi-modal encoders to integrate visual and textual features. However, they typically rely on a single fusion approach, which may limit the model’s capacity to learn intricate correlations between textual semantics and visual information. Moreover, they did not establish a direct dependency between features and bounding box representations, thereby restricting the fusion features to conventional object detection paradigm. Consequently, the interactions between regression results and encoded features are constrained. To address these limitations, a generative paradigm is harnessed to directly generate discrete coordinates sequence in an auto-regressive manner, which explores the interaction between direct regression features and encoded multi-modal features. Meanwhile, a novel multi-stage synergistic aggregation module is proposed to facilitate the acquisition of multi-modal features at multiple scales by effectively aggregating visual and textual contexts, enhancing the overall performance. In this work, we validate our framework on the DIOR-RSVG dataset and conduct a comparative analysis with existing methods, achieving a noteworthy improvement in accuracy. The proposed approach presents a promising direction for advancing visual grounding techniques in the context of remote sensing applications. The related code and weights are available at https://github.com/waynamigo/MSAM.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics