A Lightweight Multi-modal Emotion Recognition Network Based on Multi-task Learning

Peisong Liu,Xiaoping Wang
DOI: https://doi.org/10.1109/icnc52316.2021.9608488
2021-10-15
Abstract:Human emotion recognition is a very important part of the human-computer interaction process, and its application scenarios are very wide, which has received more and more attention in recent years. In this paper, a lightweight multimodal emotion recognition network is proposed, which makes the network model as small as possible under the premise of ensuring network accuracy, so that human emotion recognition can be well applied to mobile devices. Specifically, this article uses three modalities: audio, video, and text as input data. The audio signal is converted into MFCC and video signal using MobileNet for feature extraction, thereby reducing the amount of network parameters. For text data, Bert is used for feature extraction, and features extracted from the three modalities are combined through the attention mechanism. Finally, in order to improve the recognition rate and generalization ability of the network, a multi-task structure is also introduced. The experimental results show that the lightweight model can effectively reduce the amount of network parameters, greatly reduce the requirements for equipment, and make it possible to apply emotion recognition on the mobile terminal.
What problem does this paper attempt to address?