Abstract:With the rapid growth of multimedia data on the Internet, there has been a rapid rise in the demand for visual-textual cross-media retrieval between images and sentences. However, the heterogeneous property of visual and textual data brings huge challenges to measure the cross-media similarity for retrieval. Although existing methods have achieved great progress with the strong learning ability of the deep neural network, they rely heavily on the scale of training data with manual annotation, that is, either pairwise image-sentence annotation or category annotation as supervised information for visual-textual correlation learning, which are extremely labor and time consuming to collect. Without any pairwise or category annotation, it is highly challenging to construct a correlation between images and sentences due to their inconsistent distributions and representations. But people can naturally understand the correlation between visual and textual data in high-level semantic, and those images and sentences containing the same group of semantic concepts can be easily matched in human brain. Inspired by the above human cognitive process, this article proposes an unsupervised visual-textual correlation learning (UVCL) approach to construct correlations without any manual annotation. The contributions are summarized as follows: 1) unsupervised semantic-guided cross-media correlation mining is proposed to bridge the heterogeneous gap between visual and textual data. We measure the semantic matching degree between images and sentences, and generate descriptive sentences according to the concepts extracted from images to further augment the training data in an unsupervised manner. Therefore, the approach can exploit the semantic knowledge within both visual and textual data to reduce the gap between them for further correlation learning and 2) unsupervised visual-textual fine-grained semantic alignment is proposed to learn cross-media correlation by aligning the fine-grained visual local patches and textual keywords with fine-grained soft attention as well as semantic-guided hard attention, and the results can effectively highlight the fine-grained semantic information within both images and sentences to boost visual-textual alignment. Extensive experiments are conducted to perform visual-textual cross-media retrieval in unsupervised setting without any manual annotation on two widely used datasets, namely, Flickr-30K and MS-COCO, which verify the effectiveness of our proposed UVCL approach.

Cross-Modal Saliency Correlation for Image Annotation

Color boosted visual saliency detection and its application to image classification

Detect saliency to understand a photo

Cross-modal image sentiment analysis via deep correlation of textual semantic

Automatic image annotation based on salient regions

Web and personal image annotation by mining label correlation with relaxed visual graph embedding.

Fine-Grained Image Classification Via Spatial Saliency Extraction.

Correlative multi-label multi-instance image annotation

Visual-Verbal Consistency Of Image Saliency

Cross-Graph Attention Enhanced Multi-Modal Correlation Learning for Fine-Grained Image-Text Retrieval

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

Unsupervised Visual–Textual Correlation Learning with Fine-Grained Semantic Alignment

A Multiple Instance Learning Approach to Image Annotation with Saliency Map.

Towards Sketch-Based Image Retrieval with Deep Cross-Modal Correlation Learning.

AnANet: Modeling Association and Alignment for Cross-modal Correlation Classification

Towards Multi-Semantic Image Annotation with Graph Regularized Exclusive Group Lasso

Rethinking Crowdsourcing Annotation: Partial Annotation with Salient Labels for Multi-Label Image Classification

Visual Attention in Multi-Label Image Classification.

Image Saliency Estimation Via Random Walk Guided by Informativeness and Latent Signal Correlations

Automatic image annotation via local multi-label classification

Cross-Modal Attention With Semantic Consistence for Image–Text Matching