Abstract:Visual semantic embedding network or cross-modal cross-attention network are usually adopted for image-text retrieval. Existing works have confirmed that both visual semantic embedding network and cross-modal cross-attention network can achieve similar performance, but the former has lower computational complexity so that its retrieval speed is faster and its engineering application value is higher than the latter. In this paper, we propose a Super Visual Semantic Embedding Network (SVSEN) for cross-modal image-text retrieval, which contains two independent branch substructures including the image embedding network and the text embedding network. In the design of the image embedding network, firstly, a feature extraction network is employed to extract the fine-grained features of the image. Then, we design a graph attention mechanism module with residual link for image semantic enhancement. Finally, the Softmax pooling strategy is used to map the image fine-grained features to a common embedding space. In the design of the text embedding network, we use the pre-trained BERT-base-uncased to extract context-related word vectors, which will be fine-tuned in training. Finally, the fine-grained word vectors are mapped to a common embedding space by a maximum pooling. In the common embedding space, a soft label-based triplet loss function is adopted for cross-modal semantic alignment learning. Through experimental verification on two widely used datasets, namely MS-COCO and Flickr-30K, our proposed SVSEN achieves the best performance. For instance, on Flickr-30K, our SVSEN outperforms image retrieval by 3.91% relatively and text retrieval by 1.96% relatively (R@1).

Word2VisualVec: Cross-Media Retrieval by Visual Feature Prediction.

Predicting Visual Features from Text for Image and Video Caption Retrieval

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

Picture It In Your Mind: Generating High Level Visual Representations From Textual Descriptions

Learning a Semantic Space by Deep Network for Cross-media Retrieval.

Crossmedia retrieval by learning rich semantic embeddings of multimedia

Click-through-Based Word Embedding for Large Scale Image Retrieval

Stacked Convolutional Deep Encoding Network for Video-Text Retrieval.

Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes

Modality-dependent Cross-media Retrieval

Deep Binaries: Encoding Semantic-Rich Cues for Efficient Textual-Visual Cross Retrieval

Exploiting visual word co-occurrence for image retrieval.

Cross-Modality Matching Based On Fisher Vector With Neural Word Embeddings And Deep Image Features

Super Visual Semantic Embedding for Cross-Modal Image-Text Retrieval.

Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild

Multi-Modal Retrieval Via Deep Textual-Visual Correlation Learning

Visual Spatio-temporal Relation-enhanced Network for Cross-modal Text-Video Retrieval

Representing Word Image Using Visual Word Embeddings And Rnn For Keyword Spotting On Historical Document Images

Deep Visual Semantic Embedding with Text Data Augmentation and Word Embedding Initialization