Abstract:Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging topic in the remote-sensing (RS) community. It has gained growing attention because it can be flexibly used in many practical applications. In the current deep era, with the help of deep convolutional neural networks (DCNNs), many successful CMRSITR methods have been proposed. Most of them first learn valuable features from RS images and texts, respectively. Then, the obtained visual and textual features are mapped into a common space for the final retrieval. The above operations are feasible; however, two difficulties are still to be solved. One is that the semantics within the visual and textual features are misaligned due to the independent learning manner. The other one is that the deep links between RS images and texts cannot be fully explored by simple common space mapping. To overcome the above challenges, we propose a new model named interacting-enhancing feature transformer (IEFT) for CMRSITR, which regards the RS images and texts as a whole. First, a simple feature embedding module (FEM) is developed to map images and texts into the visual and textual feature spaces. Second, an information interacting-enhancing module (IIEM) is designed to simultaneously model the inner relationships between RS images and texts and enhance the visual features. IIEM consists of three feature interacting-enhancing (FIE) blocks, each of which contains an intermodality relationship interacting (IMRI) subblock and a visual feature enhancing (VFE) subblock. The duty of IMRI is to exploit the hidden relations between cross-modal data, while the responsibility of VFE is to improve the visual features. By combining them, semantic bias can be mitigated, and the complex contents of RS images can be studied. Finally, the retrieval module (RM) is constructed to generate the matching scores for deciding the search results. Extensive experiments are conducted on four public RS datasets. The positive results demonstrate that our IEFT can achieve superior retrieval performance compared with many existing methods. Our source codes are available at https://github.com/TangXu-Group/Cross-modal-remote-sensing-image-and-text-retrieval-models/tree/main/IEFT.

VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction

Select & Re-Rank: Effectively and Efficiently Matching Multimodal Data with Dynamically Evolving Attention

A simple and efficient text matching model based on deep interaction

SAViT: Structure-Aware Vision Transformer Pruning Via Collaborative Optimization.

A Multi-interaction Model with Cross-Branch Feature Fusion for Video-Text Retrieval.

RI-Match: Integrating Both Representations and Interactions for Deep Semantic Matching.

MIRTT: Learning Multimodal Interaction Representations from Trilinear Transformers for Visual Question Answering

Enhanced Pre-Trained Transformer with Aligned Attention Map for Text Matching

VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers

Short text matching model with multiway semantic interaction based on multi-granularity semantic embedding

Hierarchical visual-semantic interaction for scene text recognition

Interacting-Enhancing Feature Transformer for Cross-modal Remote Sensing Image and Text Retrieval

VIRT: Vision Instructed Transformer for Robotic Manipulation

Improving Text Semantic Similarity Modeling through a 3D Siamese Network

Explicit Pairwise Word Interaction Modeling Improves Pretrained Transformers for English Semantic Similarity Tasks

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Cross-modality interaction reasoning for enhancing vision-language pre-training in image-text retrieval

Lightweight Text Matching Method with Rich Features.

ViLEM: Visual-Language Error Modeling for Image-Text Retrieval

S2R-ViT for Multi-Agent Cooperative Perception: Bridging the Gap from Simulation to Reality