Abstract:Cross-modal image-text matching has attracted considerable interest in both computer vision and natural language processing communities. The main issue of image-text matching is to learn the compact cross-modal representations and the correlation between image and text representations. However, the image-text matching task has two major challenges. First, the current image representation methods focus on the semantic information and disregard the spatial position relations between image regions. Second, most existing methods pay little attention to improving textual representation which plays a significant role in image-text matching. To address these issues, we designed a decipherable cross-modal multi-relationship aware reasoning network (CMRN) for image-text matching. In particular, a new method is proposed to extract multi-relationship and to learn the correlations between image regions, including two kinds of visual relations: the geometric position relation and semantic interaction. In addition, images are processed as graphs, and a novel spatial relation encoder is introduced to perform reasoning on the graphs by employing a graph convolutional network (GCN) with attention mechanism. Thereafter, a contextual text encoder based on Bidirectional Encoder Representations from Transformers is adopted to learn distinctive textual representations. To verify the effectiveness of the proposed model, extensive experiments were conducted on two public datasets, namely MSCOCO and Flickr30K. The experimental results show that CMRN achieved superior performance when compared with state-of-the-art methods. On Flickr30K, the proposed method outperforms state-of-the-art methods more than 7.4% in text retrieval from image query, and 5.0% relatively in image retrieval with text query (based on Recall@1). On MSCOCO, the performance reaches 73.9% for text retrieval and 60.4% for image retrieval (based on Recall@1).

Bottom-Up Transformer Reasoning Network for Text-Image Retrieval.

Text to Point Cloud Localization with Relation-Enhanced Transformer.

Reservoir Computing Transformer for Image-Text Retrieval

Cross-modal Information Balance-Aware Reasoning Network for Image-Text Retrieval

Cross-modal Multi-Relationship Aware Reasoning for Image-Text Matching

BiC-Net: Learning Efficient Spatio-Temporal Relation for Text-Video Retrieval

A Mutually Textual and Visual Refinement Network for Image-Text Matching

FTN-VQA: MULTIMODAL REASONING BY LEVERAGING A FULLY TRANSFORMER-BASED NETWORK FOR VISUAL QUESTION ANSWERING

Target-Oriented Transformation Networks for Document Retrieval

Relation Transformer Network

What's Wrong with the Bottom-up Methods in Arbitrary-shape Scene Text Detection

Context‐aware relation enhancement and similarity reasoning for image‐text retrieval

TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning

Dual Position Relationship Transformer for Image Captioning.

Multi-level network based on transformer encoder for fine-grained image–text matching

RRTrN: A Lightweight and Effective Backbone for Scene Text Recognition

Dual-Branch Network Based on Transformer for Texture Recognition

BENet: bi-directional enhanced network for image captioning

Bottom-Up Progressive Semantic Alignment for Image-Text Retrieval

Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval

A transformer-based cross-modal image-text retrieval method using feature decoupling and reconstruction