Analyzing Multimodal Sentiment Via Acoustic- and Visual-LSTM with Channel-Aware Temporal Convolution Network

Sijie Mai,Songlong Xing,Haifeng Hu
DOI: https://doi.org/10.1109/taslp.2021.3068598
2021-01-01
IEEE/ACM Transactions on Audio Speech and Language Processing
Abstract:The emotion of human is always expressed in a multimodal perspective. Analyzing multimodal human sentiment remains challenging due to the difficulties of the interpretation in inter-modality dynamics. Mainstream multimodal learning architectures tend to design various fusion strategies to learn inter-modality interactions, which barely consider the fact that the language modality is far more important than the acoustic and visual modalities. In contrast, we learn inter-modality dynamics in a different perspective via acoustic- and visual-LSTMs where language features play dominant role. Specifically, inside each LSTM variant, a well-designed gating mechanism is introduced to enhance the language representation via the corresponding auxiliary modality. Furthermore, in the unimodal representation learning stage, instead of using RNNs, we introduce `channel-aware' temporal convolution network to extract high-level representations for each modality to explore both temporal and channel-wise interdependencies. Extensive experiments demonstrate that our approach achieves very competitive performance compared to the state-of-the-art methods on three widely-used benchmarks for multimodal sentiment analysis and emotion recognition.
What problem does this paper attempt to address?