Going Beyond Closed Sets: A Multimodal Perspective for Video Emotion Analysis.

Hao Pu,Yuchong Sun,Ruihua Song,Xu Chen,Hao Jiang,Yi Liu,Zhao Cao
DOI: https://doi.org/10.1007/978-981-99-8537-1_19
2024-01-01
Abstract:Emotion analysis plays a crucial role in understanding video content. Existing studies often approach it as a closed set classification task, which overlooks the important fact that the emotional experiences of humans are so complex and difficult to be adequately expressed in a limited number of categories. In this paper, we propose MM-VEMA, a novel MultiModal perspective for Video EMotion Analysis. We formulate the task as a crossmodal matching problem within a joint multimodal space of videos and emotional experiences (e.g. emotional words, phrases, sentences). By finding experiences that closely match each video in this space, we can derive the emotions evoked by the video in a more comprehensive manner. To construct this joint multimodal space, we introduce an efficient yet effective method that manipulates the multimodal space of a pre-trained vision-language model using a small set of emotional prompts. We conduct experiments and analyses to demonstrate the effectiveness of our methods. The results show that videos and emotional experiences are well aligned in the joint multimodal space. Our model also achieves state-of-the-art performance on three public datasets.
What problem does this paper attempt to address?