Benchmarking Multimodal Sentiment Analysis

Erik Cambria,Devamanyu Hazarika,Soujanya Poria,Amir Hussain,R.B.V. Subramaanyam
DOI: https://doi.org/10.48550/arXiv.1707.09538
2017-07-30
Abstract:We propose a framework for multimodal sentiment analysis and emotion recognition using convolutional neural network-based feature extraction from text and visual modalities. We obtain a performance improvement of 10% over the state of the art by combining visual, text and audio features. We also discuss some major issues frequently ignored in multimodal sentiment analysis research: the role of speaker-independent models, importance of the modalities and generalizability. The paper thus serve as a new benchmark for further research in multimodal sentiment analysis and also demonstrates the different facets of analysis to be considered while performing such tasks.
Multimedia,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively combine text, visual and audio features in multi - modal sentiment analysis and emotion recognition to improve the accuracy of sentiment and emotion recognition. Specifically, the paper focuses on the following aspects: 1. **Multi - modal feature extraction and fusion**: The paper proposes a framework based on Convolutional Neural Network (CNN) for extracting features from text and visual modalities and combining audio features for multi - modal sentiment analysis. Through this method, the author hopes to achieve a performance improvement of more than 10% on the benchmark dataset. 2. **The role of speaker - independent models**: The paper explores the importance of speaker - independent models in multi - modal sentiment analysis. Traditional multi - modal sentiment analysis research often includes the same speakers in the training and test sets, which may lead to model over - fitting. Therefore, the paper verifies the performance of speaker - independent models through experiments, which is crucial for the generalization ability in practical applications. 3. **The importance of different modalities**: The paper analyzes the relative contributions of text, visual and audio modalities in sentiment analysis. Through the experimental results, the author finds that the text modality usually performs better than the visual and audio modalities, but in some cases, the visual and audio modalities can provide important supplementary information. 4. **The generalization ability of the model**: The paper also discusses the generalization ability of the model on different datasets. The author trains the model on one dataset and tests it on another dataset to evaluate the cross - dataset performance of the model. The results show that due to language and cultural differences, there are significant differences in the performance of the model on different datasets. In summary, this paper aims to solve the problems existing in the existing methods, such as speaker - dependence, evaluation of modality importance and generalization ability of the model, by proposing a new multi - modal sentiment analysis framework. These studies not only improve the accuracy of sentiment analysis, but also provide a new direction for future research.