Multimodal Fusion Remote Sensing Image–Audio Retrieval
Rui Yang,Shuang Wang,Yingzhi Sun,Huan Zhang,Yu Liao,Yu Gu,Biao Hou,Licheng Jiao
DOI: https://doi.org/10.1109/jstars.2022.3194076
IF: 4.715
2022-08-12
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Abstract:Remote sensing image–audio retrieval (RSIAR) has been an emerging research topic in recent years, and many different methods have been proposed for this topic. These RSIAR methods have achieved good retrieval results, but two problems remain: the lack of discriminability of audio modality and the existence of a heterogeneous gap between audio and image. These two problems make the cross-modal common embedding space for audio and images suboptimal, often failing to perform superior retrieval. This article proposes a novel RSIAR method named multimodal fusion remote sensing image–audio retrieval (MMFR) to address these two problems. MMFR first converts original audio input to text. Then, MMFR uses a feature fusion module to obtain a fusion representation fused with text information instead of the original sole audio representation. Fusion text information can make the pronunciation-based audio feature more semantically discriminable and convert pronunciation-based audio feature to more "high-level" fusion feature to cross the heterogeneous gap. Seven different fusion methods are tried in the feature fusion module. In addition, the triplet loss, the semantic loss, and the consistency loss are used to optimize the common retrieval space. Extensive experiments conducted on the UCM_IV, RSICD_IV, and SYDNE_IV datasets demonstrate that our MMFR method outperforms state-of-the-art methods.
imaging science & photographic technology,remote sensing,engineering, electrical & electronic,geography, physical