Attribute-Prompting Multi-Modal Object Reasoning Transformer for Remote Sensing Visual Grounding

Heqian Qiu,Lanxiao Wang,Minjian Zhang,Taijin Zhao,Hongliang Li
DOI: https://doi.org/10.1109/igarss53475.2024.10642669
2024-01-01
Abstract:Remote sensing visual grounding (RSVG) task aims to locate the particular object in a remote sensing image referred to a natural language expression, which requires to precisely fuse and align features from different modalities. However, existing methods usually use object-based multi-modal fusion, which is limited to capturing the detailed object characteristics in remote sensing images, resulting in object confusion with similar objects. To address this problem, we propose an attribute-prompting multi-modal object reasoning network for RSVG. Specifically, we first develop a learnable attribute prompter to adaptively explore diverse and rich attribute information according to common object characteristics in RS. With the help of attribute prompts, we design an attribute-prompting multi-modal fusion encoder to build fine-grained interactive and alignment between the visual and language features to avoid object confusion. Furthermore, we design a multi-modal progressive object reasoning decoder to gradually query more comprehensive object features for accurate object localization. Experimental results demonstrate that the proposed method achieves significant improvements.
What problem does this paper attempt to address?