A Fusion-Based Contrastive Learning Model for Cross-Modal Remote Sensing Retrieval

Haoran Li,Wei Xiong,Yaqi Cui,Zhenyu Xiong
DOI: https://doi.org/10.1080/01431161.2022.2091964
IF: 3.531
2022-01-01
International Journal of Remote Sensing
Abstract:With the rapid growth of cross-modal data, cross-modal retrieval has become a research hotspot in the field of remote sensing, and remote-sensing image-text retrieval (RSITR) has attracted extensive attention for its flexible and efficient way to get the interested information and its available application. However, most of the existing methods cannot adequately extract fine unimodal features and are poor at exploring potential correlations between different modalities, leading to unsatisfactory performance. Besides, the majority of the existing datasets and methods for image-text retrieval are based on English, and few researches are focused on Chinese captions, but the application of image-text retrieval in the remote-sensing field should not be restricted by the language. In this article, we introduce a novel fusion-based contrastive learning model (FBCLM) for RSITR to cope with the problems of unimodal feature extracting and correlation exploring of remote-sensing image-text pairs, and the model is available for image-text retrieval on both English and Chinese caption datasets. Our model employs the unimodal encoder containing the self-attention module to extract the fine-grained features of the single modal and further utilizes the cross-modal fusion module to improve the discriminative ability of feature representation, which uses the cross attention mechanism. Furthermore, contrastive loss is applied to the method to enhance the image-text retrieval performance by exploring the underlying semantic relationship between visual and textual representations. In addition, we construct several remote-sensing image Chinese caption datasets for RSITR. The experimental results on several public RSITR datasets and the proposed datasets demonstrate the outperformance of our model in the cross-modal remote-sensing image-text retrieval task.
What problem does this paper attempt to address?