Early Joint Learning of Emotion Information Makes MultiModal Model Understand You Better
Mengying Ge,Mingyang Li,Dongkai Tang,Pengbo Li,Kuo Liu,Shuhao Deng,Songbai Pu,Long Liu,Yang Song,Tao Zhang
2024-09-12
Abstract:In this paper, we present our solutions for emotion recognition in the sub-challenges of Multimodal Emotion Recognition Challenge (MER2024). To mitigate the modal competition issue between audio and text, we adopt an early fusion strategy based on a large language model, where joint training of audio and text is conducted initially. And the joint Audio-Text modal feature will be late-fused with other unimodal features. In order to solve the problems of data insufficiency and class imbalance, We use multiple turns of multi-model voting for data mining. Moreover, to enhance the quality of audio features, we employ speech source separation to preprocess audios. Our model ranks \textbf{2nd} in both MER2024-SEMI and MER2024-NOISE, validating our method's effectiveness.
Multimedia,Artificial Intelligence,Sound,Audio and Speech Processing