MMSF: A Multimodal Sentiment-Fused Method to Recognize Video Speaking Style.

Fan Yu,Beibei Zhang,Yaqun Fang,Jia Bei,Tongwei Ren,Jiyi Li,Luca Rossetto
DOI: https://doi.org/10.1145/3591106.3592219
2024-01-01
Abstract:As talking takes a large proportion of human lives, it is necessary to perform deeper understanding of human conversations. Speaking style recognition is aimed at recognizing the styles of conversations, which provides a fine-grained description about talking. Current works focus on adopting only visual clues to recognize speaking styles, which cannot accurately distinguish different speaking styles when they are visually similar. To recognize speaking styles more effectively, we propose a novel multimodal sentiment-fused method, MMSF, which extracts and integrates visual, audio and textual features of videos. In addition, as sentiment is one of the motivations of human behavior, we first introduce sentiment into our multimodal method with cross-attention mechanism, which enhance the video feature to recognize speaking styles. The proposed MMSF is evaluated on long-form video understanding benchmark, and the experiment results show that it is superior to the state-of-the-arts.
What problem does this paper attempt to address?