Abstract:With the rapid growth of multimedia data on the Internet, there has been a rapid rise in the demand for visual-textual cross-media retrieval between images and sentences. However, the heterogeneous property of visual and textual data brings huge challenges to measure the cross-media similarity for retrieval. Although existing methods have achieved great progress with the strong learning ability of the deep neural network, they rely heavily on the scale of training data with manual annotation, that is, either pairwise image-sentence annotation or category annotation as supervised information for visual-textual correlation learning, which are extremely labor and time consuming to collect. Without any pairwise or category annotation, it is highly challenging to construct a correlation between images and sentences due to their inconsistent distributions and representations. But people can naturally understand the correlation between visual and textual data in high-level semantic, and those images and sentences containing the same group of semantic concepts can be easily matched in human brain. Inspired by the above human cognitive process, this article proposes an unsupervised visual-textual correlation learning (UVCL) approach to construct correlations without any manual annotation. The contributions are summarized as follows: 1) unsupervised semantic-guided cross-media correlation mining is proposed to bridge the heterogeneous gap between visual and textual data. We measure the semantic matching degree between images and sentences, and generate descriptive sentences according to the concepts extracted from images to further augment the training data in an unsupervised manner. Therefore, the approach can exploit the semantic knowledge within both visual and textual data to reduce the gap between them for further correlation learning and 2) unsupervised visual-textual fine-grained semantic alignment is proposed to learn cross-media correlation by aligning the fine-grained visual local patches and textual keywords with fine-grained soft attention as well as semantic-guided hard attention, and the results can effectively highlight the fine-grained semantic information within both images and sentences to boost visual-textual alignment. Extensive experiments are conducted to perform visual-textual cross-media retrieval in unsupervised setting without any manual annotation on two widely used datasets, namely, Flickr-30K and MS-COCO, which verify the effectiveness of our proposed UVCL approach.

Learning Semantic Correlation of Web Images and Text with Mixture of Local Linear Mappings

Semantic Correlation Mining between Images and Texts with Global Semantics and Local Mapping.

Kernel-Based Mixture Mapping for Image and Text Association

Semantic image classification using statistical local spatial relations model

Analyzing semantic correlation for cross-modal retrieval

Learning nonlinear manifolds based on mixtures of localized linear manifolds under a self-organizing framework

Local Linear Matrix Factorization for Document Modeling

Bidirectional-isomorphic Manifold Learning at Image Semantic Understanding & Representation.

Image-Text Embedding Learning Via Visual and Textual Semantic Reasoning.

Multilateral Semantic Relations Modeling for Image Text Retrieval

Semantic and Correlation Disentangled Graph Convolutions for Multilabel Image Recognition.

Exploiting Multi-Context Analysis in Semantic Image Classification

Multi-granularity Correlation Refinement for Semantic Correspondence

Content-oriented Multimedia Document Understanding Through Cross-Media Correlation

Video Semantic Concept Detection Using Multi-Modality Subspace Correlation Propagation

Collaborative Similarity Metric Learning for Semantic Image Annotation and Retrieval.

Exploring Entity-Level Spatial Relationships for Image-Text Matching

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Learning Descriptive Visual Representation by Semantic Regularized Matrix Factorization.

Unsupervised Visual–Textual Correlation Learning with Fine-Grained Semantic Alignment

Web Image Semi-supervised Learning Method Based on Heterogeneous Information Fusion