Tag-assisted Multimodal Sentiment Analysis under Uncertain Missing Modalities

Jiandian Zeng,Tianyi Liu,Jiantao Zhou
DOI: https://doi.org/10.48550/arXiv.2204.13707
2022-04-28
Abstract:Multimodal sentiment analysis has been studied under the assumption that all modalities are available. However, such a strong assumption does not always hold in practice, and most of multimodal fusion models may fail when partial modalities are missing. Several works have addressed the missing modality problem; but most of them only considered the single modality missing case, and ignored the practically more general cases of multiple modalities missing. To this end, in this paper, we propose a Tag-Assisted Transformer Encoder (TATE) network to handle the problem of missing uncertain modalities. Specifically, we design a tag encoding module to cover both the single modality and multiple modalities missing cases, so as to guide the network's attention to those missing modalities. Besides, we adopt a new space projection pattern to align common vectors. Then, a Transformer encoder-decoder network is utilized to learn the missing modality features. At last, the outputs of the Transformer encoder are used for the final sentiment classification. Extensive experiments are conducted on CMU-MOSI and IEMOCAP datasets, showing that our method can achieve significant improvements compared with several baselines.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of missing partial - modal data in multimodal sentiment analysis. Specifically, most existing multimodal fusion models assume that all modalities are available during training and testing, but in practical applications, situations where certain modal data are missing are often encountered. For example, visual features are lost due to insufficient camera coverage, acoustic information is unavailable due to excessive environmental noise, and text information is missing due to privacy issues. Therefore, how to handle the missing modalities in multimodal data has become a research hotspot. The paper points out that although some previous works have attempted to solve the problem of a single - modal missing, these methods usually ignore the more common situation of multiple modalities being missing simultaneously. That is, they need to train a new model separately for each case of missing modalities, which is both time - consuming and inconvenient. In addition, in practical applications, the pattern of missing modalities may be uncertain, for example, one or two modalities are randomly missing. To solve the above problems, the paper proposes a label - assisted Transformer encoder (TATE) network to learn complementary features between modalities. To address the first challenge, a label - encoding module is designed to mark the missing modalities, aiming to guide the network to focus on those missing modalities. For the second challenge, first, the Transformer is used as an extractor to capture the features within the modality, and then they are mapped to a common space through a pairwise projection pattern. After that, the pre - trained full - modality network is used to supervise the encoded vectors. Finally, the output generated by the Transformer encoder is sent to the classifier for sentiment prediction. The main contributions of the paper include: - Proposing the TATE network to deal with the problem of multiple - modal missing in multimodal sentiment analysis and making the code public. - Designing a label - encoding module to cover the cases of single - modal and multiple - modal missing, and adopting a new common - space projection module to learn the joint representation. - On the CMU - MOSI and IEMOCAP datasets, compared with several benchmark models, the proposed TATE model has achieved significant improvement, verifying the effectiveness of the model.