Abstract:We propose a framework for multimodal sentiment analysis and emotion recognition using convolutional neural network-based feature extraction from text and visual modalities. We obtain a performance improvement of 10% over the state of the art by combining visual, text and audio features. We also discuss some major issues frequently ignored in multimodal sentiment analysis research: the role of speaker-independent models, importance of the modalities and generalizability. The paper thus serve as a new benchmark for further research in multimodal sentiment analysis and also demonstrates the different facets of analysis to be considered while performing such tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to effectively combine text, visual and audio features in multi - modal sentiment analysis and emotion recognition to improve the accuracy of sentiment and emotion recognition. Specifically, the paper focuses on the following aspects: 1. **Multi - modal feature extraction and fusion**: The paper proposes a framework based on Convolutional Neural Network (CNN) for extracting features from text and visual modalities and combining audio features for multi - modal sentiment analysis. Through this method, the author hopes to achieve a performance improvement of more than 10% on the benchmark dataset. 2. **The role of speaker - independent models**: The paper explores the importance of speaker - independent models in multi - modal sentiment analysis. Traditional multi - modal sentiment analysis research often includes the same speakers in the training and test sets, which may lead to model over - fitting. Therefore, the paper verifies the performance of speaker - independent models through experiments, which is crucial for the generalization ability in practical applications. 3. **The importance of different modalities**: The paper analyzes the relative contributions of text, visual and audio modalities in sentiment analysis. Through the experimental results, the author finds that the text modality usually performs better than the visual and audio modalities, but in some cases, the visual and audio modalities can provide important supplementary information. 4. **The generalization ability of the model**: The paper also discusses the generalization ability of the model on different datasets. The author trains the model on one dataset and tests it on another dataset to evaluate the cross - dataset performance of the model. The results show that due to language and cultural differences, there are significant differences in the performance of the model on different datasets. In summary, this paper aims to solve the problems existing in the existing methods, such as speaker - dependence, evaluation of modality importance and generalization ability of the model, by proposing a new multi - modal sentiment analysis framework. These studies not only improve the accuracy of sentiment analysis, but also provide a new direction for future research.

Benchmarking Multimodal Sentiment Analysis

Attention-based multimodal sentiment analysis and emotion recognition using deep neural networks

Exploring Multimodal Sentiment Analysis via CBAM Attention and Double-layer BiLSTM Architecture

Multimodal Sentiment Intensity Analysis in Videos: Facial Gestures and Verbal Messages

Multimodal Emotional Classification Based on Meaningful Learning

Investigating Multisensory Integration in Emotion Recognition Through Bio-Inspired Computational Models

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

A customizable framework for multimodal emotion recognition using ensemble of deep neural network models

Integrative Sentiment Analysis: Leveraging Audio, Visual, and Textual Data

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

Multimodal Sentiment Analysis: A Survey of Methods, Trends, and Challenges

A Novel Context-Aware Multimodal Framework for Persian Sentiment Analysis

A soft voting ensemble learning-based approach for multimodal sentiment analysis

A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis

Multimodal sentiment analysis leveraging the strength of deep neural networks enhanced by the XGBoost classifier

Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis

An efficient multimodal sentiment analysis in social media using hybrid optimal multi-scale residual attention network

Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey

Multimodal Sentiment Analysis: A Survey

Multimodal Sentiment Recognition With Multi-Task Learning