Neural Visual Social Comment on Image-Text Content

Yue Yin,Hanzhou Wu,Xinpeng Zhang
DOI: https://doi.org/10.1080/02564602.2020.1730714
2020-01-01
Abstract:Social bots are computer software designed for content production and interaction with humans. With the popularity of images in social networks, social bots need to have visual awareness of image content while only understanding texts is far from enough to be active in social networks. We introduce a novel task, Visual Social Comment (VSC), in which social bots should generate relevant and informative comments on social contents of both images and texts. In this task of multimodal context, our work focuses on how to extract and fuse the information of vision and text to improve the quality of generated comments, and how to deal with the problem that neural dialog models trained with maximum likelihood estimation (MLE) criteria tend to generate generic responses. In order to fuse visual and textual context features closely through the relationship between them, we adopt joint attention of multimodal context to modify the standard sequence-to-sequence (Seq2Seq) framework. We also leverage the topic information transferred from a topic classification model to build a perceptual loss function, which encourages the generative comment model to generate more informative and diverse comments with the topic corresponding to context. The experimental results of models trained with data from Sina Weibo show that comments generated by our proposed models achieve better performance in both relevance and informativeness than those generated by other baseline models.
What problem does this paper attempt to address?