Learning Modality-Fused Representation Based on Transformer for Emotion Analysis

Piao Shi,Min Hu,Fuji Ren,Xuefeng Shi,Liangfeng Xu
DOI: https://doi.org/10.1117/1.jei.31.6.063032
IF: 0.829
2022-01-01
Journal of Electronic Imaging
Abstract:Modality-fused representation is an essential and challenging task in multimodal emotion analysis. Previous studies have already yielded remarkable achievements. However, there are two problems: insufficient feature interaction and rough data fusion. To investigate these two challenges more deeply, first, a hybrid architecture, which consists of convolution and a transformer, is proposed to extract local and global features. Second, for extracting more sufficient mutual features from multimodal datasets, our model is comprised of three parts: (1) the interior transformer encoder (TE) aims to extract the intramodality characteristics from the current monomodality; (2) the between TE aims to extract the intermodality feature between two different modalities; and (3) the enhance TE aims to extract the target modality enhance feature from multimodality. Finally, instead of directly fusing features by a linear function, we employ a popular and widely used multimodal factorized high-order pooling mechanism to obtain a more distinguishable feature representation. Extensive experiments on three multimodal sentiment datasets (CMU-MOSEI, CMU-MOSI, and IEMOCAP) demonstrate that our approach reaches the state-of-the-art in an unaligned version setting. Compared with the mainstream methods, our proposed method shows superiority in both word-aligned and unaligned settings. (c) 2022 SPIE and IS&T
What problem does this paper attempt to address?