Abstract:With the rapid advent and abundance of remote sensing data in different modalities, cross-modal retrieval tasks have gained importance in the research community. Cross-modal retrieval belongs to the research paradigm in which the query is of one modality and the retrieved output is of the other modality. In this paper, the remote sensing (RS) data modalities considered are the earth observation optical data (aerial photos) and the corresponding hand-drawn sketches. The main challenge of the cross-modal retrieval research objective for optical remote sensing images and the corresponding sketches is the distribution gap between the shared embedding space of the modalities. Prior attempts to resolve this issue have not yielded satisfactory outcomes regarding accurately retrieving cross-modal sketch-image RS data. The state-of-the-art architectures used conventional convolutional architectures, which focused on local pixel-wise information about the modalities to be retrieved. This limits the interaction between the sketch texture and the corresponding image, making these models susceptible to overfitting datasets with particular scenarios. To circumvent this limitation, we suggest establishing multi-modal correspondence using a novel architecture of the combined self and cross-attention algorithms, SPCA-Net to minimize the modality gap by employing attention mechanisms for the query and other modalities. Efficient cross-modal retrieval is achieved through the suggested attention architecture, which empirically emphasizes the global information of the relevant query modality and bridges the domain gap through a unique pairwise cross-attention network. In addition to the novel architecture, this paper introduces a unique loss function, label-specific supervised contrastive loss , tailored to the intricacies of the task and to enhance the discriminative power of the learned embeddings. Extensive evaluations are conducted on two sketch-image remote sensing datasets, Earth-on-Canvas and RSketch. Under the same experimental conditions, the performance metrics of our proposed model beat the state-of-the-art architectures by significant margins of 16.7%, 18.9%, 33.7%, and 40.9% correspondingly.

Feature boosting with efficient attention for scene parsing

Deep Dual-Stream Network with Scale Context Selection Attention Module for Semantic Segmentation

Background-aware Siamese Network Tracking Based on Salient Feature Fusion

Channel and Spatial Enhancement Network for human parsing

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing

Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers

Receptive Field Broadening and Boosting for Salient Object Detection

Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks

Efficient Multi-Scale Attention Module with Cross-Spatial Learning

STEAM: Squeeze and Transform Enhanced Attention Module

Graph-Boosted Attentive Network for Semantic Body Parsing

SCSA: Exploring the Synergistic Effects Between Spatial and Channel Attention

Attention and Language Ensemble for Scene Text Recognition with Convolutional Sequence Modeling.

Aggregating Attentional Dilated Features for Salient Object Detection

SACNet: A Scattered Attention-based Network with Feature Compensator for Visual Localization

A holistic representation guided attention network for scene text recognition

OCNet: Object Context Network for Scene Parsing

Object Detection With Extended Attention And Spatial Information

Audio-visual scene recognition using attention-based graph convolutional model

Boosting cross-modal retrieval in remote sensing via a novel unified attention network