Abstract:With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss , tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.

Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Retrieval Across Optical and SAR Images with Deep Neural Network.

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images

SIRS: Multitask Joint Learning for Remote Sensing Foreground-Entity Image–Text Retrieval

A Lightweight Multi-Scale Crossmodal Text-Image Retrieval Method in Remote Sensing

Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information

Human Communication-Inspired Semantic–View Collaborative Network for Multispectral Remote Sensing Image Retrieval

Multi-scale network with shared cross-attention for audio–visual correlation learning

Transcending Fusion: A Multiscale Alignment Method for Remote Sensing Image–Text Retrieval

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

Spatial–Channel Attention Transformer With Pseudo Regions for Remote Sensing Image-Text Retrieval

Robust Cross-Modal Remote Sensing Image Retrieval Via Maximal Correlation Augmentation

A Case Study on Visual-Audio-Tactile Cross-Modal Retrieval

SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention

Learning SAR-Optical Cross Modal Features for Land Cover Classification

Semantic-Guided Attention Refinement Network for Salient Object Detection in Optical Remote Sensing Images

Boosting cross-modal retrieval in remote sensing via a novel unified attention network

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning