Multi-Modal Emotion Recognition by Fusing Correlation Features of Speech-Visual

Chen Guanghui,Zeng Xiaoping
DOI: https://doi.org/10.1109/lsp.2021.3055755
2021-01-01
IEEE Signal Processing Letters
Abstract:To effectively fuse speech and visual features, this letter proposes a multi-modal emotion recognition method by fusing correlation features of speech-visual. Firstly, speech and visual features are extracted by two-dimensional convolutional neural network (2D-CNN) and three-dimensional convolutional neural network (3D-CNN), respectively. Secondly, the speech and visual features is processed by feature correlation analysis algorithm in multi-modal fusion. In addition, the class information of speech and visual features are also applied to the feature correlation analysis algorithm, which can effectively fuse speech and visual features and improve the performance of multi-modal emotion recognition. Finally, support vector machines (SVM) completes the classification of multi-modal speech and visual emotion recognition. Experimental results on the RML, eNTERFACE05, BAUM-1 s datasets show that the recognition rate of our method is higher than other state-of-the-art methods.
What problem does this paper attempt to address?