Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval

Yaxiong Chen,Chuang Du,Yunfei Zi,Shengwu Xiong,Xiaoqiang Lu
DOI: https://doi.org/10.1109/tgrs.2024.3443085
IF: 8.2
2024-08-25
IEEE Transactions on Geoscience and Remote Sensing
Abstract:Remote sensing (RS) audio-visual cross-modal retrieval is a challenging task in the search of meaningful RS information. Nevertheless, the impact of multiscale features and associated redundant information in the RS images cannot be overlooked in the retrieval task. In addition, how to deal with the completely different physical expressions of different modal information is crucial for cross-modal retrieval tasks. To tackle these issues, we propose a Scale-aware Adaptive Refinement and Cross-Interaction (SARCI) network. The Quaternion-attention Dominated Multiscale Visual Refinement (QDMVR) module in SARCI is suggested to learn multiscale visual features and further optimize features containing redundant information for different scale features. To better integrate channel attention and spatial attention for adaptively learning of meaningful visual semantics, we propose the symmetric quaternion attention (SQA) within the QDMVR module to enhance RS visual features. The SQA mechanism acts on both high-level and low-level features to explore salient RS vision information across different scales. In order to allow information from different modalities to interact more valuably, we propose the Instruction-based Cross-Learning Module (ICLM) to perform cross-modal feature interaction based on the characteristic of the two modalities. SARCI network demonstrates state-of-the-art performance on three public RS cross-modal datasets: Sydney, UCM, and RSICD audio-visual datasets. The code is available at: https://github.com/WUTCM-Lab/SARCI.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geochemistry & geophysics
What problem does this paper attempt to address?