Multimodal Emotion-Cause Pair Extraction in Conversations

Fanfan Wang,Zixiang Ding,Rui Xia,Zhaoyu Li,Jianfei Yu
DOI: https://doi.org/10.48550/arXiv.2110.08020
2021-10-15
Abstract:Emotion cause analysis has received considerable attention in recent years. Previous studies primarily focused on emotion cause extraction from texts in news articles or microblogs. It is also interesting to discover emotions and their causes in conversations. As conversation in its natural form is multimodal, a large number of studies have been carried out on multimodal emotion recognition in conversations, but there is still a lack of work on multimodal emotion cause analysis. In this work, we introduce a new task named Multimodal Emotion-Cause Pair Extraction in Conversations, aiming to jointly extract emotions and their associated causes from conversations reflected in multiple modalities (text, audio and video). We accordingly construct a multimodal conversational emotion cause dataset, Emotion-Cause-in-Friends, which contains 9,272 multimodal emotion-cause pairs annotated on 13,509 utterances in the sitcom Friends. We finally benchmark the task by establishing a baseline system that incorporates multimodal features for emotion-cause pair extraction. Preliminary experimental results demonstrate the potential of multimodal information fusion for discovering both emotions and causes in conversations.
Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to jointly extract emotions and their corresponding causes in conversations, especially in a multimodal (text, audio, and video) environment. Specifically, the authors introduce a new task - Multimodal Emotion - Cause Pair Extraction (MC - ECPE), which aims to simultaneously extract emotions and their related causes from conversations that are reflected in multiple modalities. Previously, although many studies have focused on text - based emotion - cause analysis, research on emotion - cause analysis in a multimodal environment is still relatively scarce. Through this new task, the authors hope to be able to use multimodal information to more accurately discover emotions and their triggers in conversations. To achieve this goal, the authors construct a multimodal conversation emotion - cause dataset named Emotion - Cause - in - Friends (ECF), which contains 13,509 utterance segments from the American TV series "Friends" and is annotated with 9,272 multimodal emotion - cause pairs. In addition, they also establish a benchmark system that combines multimodal features for the extraction of emotion - cause pairs and verifies the potential of multimodal information fusion for discovering emotions and their causes in conversations through preliminary experiments.