Abstract:With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss , tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.

Retrieval Across Optical and SAR Images with Deep Neural Network.

Learning Disentangled Representation for Cross-Modal Retrieval with Deep Mutual Information Estimation.

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

SAR-CDSS: A Semi-Supervised Cross-Domain Object Detection from Optical to SAR Domain

Scale-Aware Adaptive Refinement and Cross-Interaction for Remote Sensing Audio-Visual Cross-Modal Retrieval

Cross-Modality Features Fusion for Synthetic Aperture Radar Image Segmentation

Reciprocal translation between SAR and optical remote sensing images with cascaded-residual adversarial networks

Learning SAR-Optical Cross Modal Features for Land Cover Classification

Explore Better Network Framework for High-Resolution Optical and SAR Image Matching

Cross‐modal retrieval with dual multi‐angle self‐attention

A Bridge Neural Network-Based Optical-SAR Image Joint Intelligent Interpretation Framework

Infrared and Visible Cross-Modal Image Retrieval Through Shared Features

Boosting cross-modal retrieval in remote sensing via a novel unified attention network

Learning to Find the Optimal Correspondence Between SAR and Optical Image Patches

Multi-task hierarchical convolutional network for visual-semantic cross-modal retrieval

Deep Supervised Dual Cycle Adversarial Network for Cross-Modal Retrieval

A lightweight deep convolutional network with inverted residuals for matching optical and SAR images

Multi-Temporal Sentinel-1 and -2 Data Fusion for Optical Image Simulation

A Transformer and Visual Foundation Model-Based Method for Cross-View Remote Sensing Image Retrieval

DAFCNN: A Dual-Channel Feature Extraction and Attention Feature Fusion Convolution Neural Network for SAR Image and MS Image Fusion

Remote Sensing Image Retrieval with Deep Features Encoding of Inception V4 and Largevis Dimensionality Reduction