Image-Text Multimodal Emotion Classification via Multi-View Attentional Network
Xiaocui Yang,Shi Feng,Daling Wang,Yifei Zhang
DOI: https://doi.org/10.1109/tmm.2020.3035277
IF: 7.3
2021-01-01
IEEE Transactions on Multimedia
Abstract:Compared with single-modal content, multimodal data can express users' feelings and sentiments more vividly and interestingly. Therefore, multimodal sentiment analysis has become a popular research topic. However, most existing methods either learn modal sentiment feature independently, without considering their correlations, or they simply integrate multimodal features. In addition, most publicly available multimodal datasets are labeled by sentiment polarities, while the emotions expressed by users are specific. Based on this observation, in this paper, we build a large-scale image-text emotion dataset (i.e., labeled by different emotions), called TumEmo, with more than 190,000 instances from Tumblr.1 We further propose a novel multimodal emotion analysis model based on the Multi-view Attentional Network (MVAN), which utilizes a memory network that is continually updated to obtain the deep semantic features of image-text. The model includes three stages: feature mapping, interactive learning, and feature fusion. In the feature mapping stage, we leverage image features from an object viewpoint and a scene viewpoint to capture effective information for multimodal emotion analysis. Then, an interactive learning mechanism is adopted that uses the memory network; this mechanism extracts single-modal emotion features and interactively models the cross-view dependencies between the image and text. In the feature fusion stage, multiple features are deeply fused using a multilayer perceptron and a stacking-pooling module. The experimental results on the MVSA-Single, MVSA-Multiple, and TumEmo datasets show that the proposed MVAN outperforms strong baseline models by large margins.
computer science, information systems,telecommunications, software engineering