Abstract:Nowadays, the amount of multimedia contents in microblogs is growing significantly. More than 20% of microblogs link to a picture or video in certain large systems. The rich semantics in microblogs provides an opportunity to endow images with higher-level semantics beyond object labels. However, this raises new challenges for understanding the association between multimodal multimedia contents in multimedia-rich microblogs. Disobeying the fundamental assumptions of traditional annotation, tagging, and retrieval systems, pictures and words in multimedia-rich microblogs are loosely associated and a correspondence between pictures and words cannot be established. To address the aforementioned challenges, we present the first study analyzing and modeling the associations between multimodal contents in microblog streams, aiming to discover multimodal topics from microblogs by establishing correspondences between pictures and words in microblogs. We first use a data-driven approach to analyze the new characteristics of the words, pictures, and their association types in microblogs. We then propose a novel generative model called the Bilateral Correspondence Latent Dirichlet Allocation (BC-LDA) model. Our BC-LDA model can assign flexible associations between pictures and words and is able to not only allow picture-word co-occurrence with bilateral directions, but also single modal association. This flexible association can best fit the data distribution, so that the model can discover various types of joint topics and generate pictures and words with the topics accordingly. We evaluate this model extensively on a large-scale real multimedia-rich microblogs dataset. We demonstrate the advantages of the proposed model in several application scenarios, including image tagging, text illustration, and topic discovery. The experimental results demonstrate that our proposed model can significantly and consistently outperform traditional approaches.

Matching words and pictures

A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

Part2Word: Learning Joint Embedding of Point Clouds and Text by Matching Parts to Words.

Hierarchical Matching With Side Information For Image Classification

Automatic Image Annotation Based on Wordnet and Hierarchical Ensembles

Understanding, Categorizing and Predicting Semantic Image-Text Relations

Dual Semantic Relationship Attention Network for Image-Text Matching

Multilateral Semantic Relations Modeling for Image Text Retrieval

Cross-modal Semantically Augmented Network for Image-text Matching

Learn from your neighbor: Learning multi-modal mappings from sparse annotations

A Probabilistic Semantic Model for Image Annotation and Multi-Modal Image Retrieval

Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Multi-Scale Fine-Grained Alignments for Image and Sentence Matching

Bilateral Correspondence Model for Words-and-Pictures Association in Multimedia-Rich Microblogs

Multimodal Distributional Semantics

Multi-Modal Image Annotation with Multi-Instance Multi-Label LDA.

Effective Multi-Modal Multi-Label Learning for Automatic Image Annotation.

Multi-Modal Memory Enhancement Attention Network for Image-Text Matching

Area-keywords Cross-Modal Alignment for Referring Image Segmentation