Multimodal emotion recognition from facial expression and speech based on feature fusion
Guichen Tang,Yue Xie,Ke Li,Ruiyu Liang,Li Zhao
DOI: https://doi.org/10.1007/s11042-022-14185-0
IF: 2.577
2022-11-12
Multimedia Tools and Applications
Abstract:Multimodal emotion recognition is designed to use expression and speech information to identify individual behaviors. Feature fusion can enrich various modal information, which is an important method for multimodal emotion recognition. However, there are several modal information synchronizations and overfitting problems due to large feature dimensions. So, an attention mechanism is introduced to automate the network to pay attention to local effective information. It is used to perform audio and video feature fusion tasks and timing modeling tasks in the network. The main contributions are as follows: 1) the multi-head self-attention mechanism is used for feature fusion of audio and video data to avoid the influence of prior information on the fusion results, and 2) a bidirectional gated recurrent unit is used to model the time series of fusion features; furthermore, the autocorrelation coefficient in the time dimension is also calculated as attention for fusion. Experiment results show that the adopted attention mechanism can effectively improve the accuracy of multimodal emotion recognition.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering