Multimodal Dialogue Understanding via Holistic Modeling and Sequence Labeling.

Chenran Cai,Qin Zhao,Ruifeng Xu,Bing Qin
DOI: https://doi.org/10.1007/978-3-031-44699-3_36
2023-01-01
Abstract:This paper introduces the experimental schemes of Team HLT-base for the NLPCC-2023-Shared-Task-10 Learn to Watch TV: Multimodal Dialogue Understanding and Response Prediction (MDUG) competition. In this paper, we focus on two subtasks of multimodal dialogue understanding: the dialogue scene identification task and the dialogue session identification task. To solve these subtasks, we propose a simple and efficient multimodal framework, where two points are taken into account: i.e., modeling the interaction of different utterances and effectively fusing the information of different modalities. For the former, we concatenate all utterances into a single sentence and feed it into the pre-trained model; for the latter, we use a transformer layer to fuse the multimodal features. Extensive experiments show that our proposed framework achieves state-of-the-art (SOTA) performance compared with other competitive methods, and ranks 1st in both subtasks (i.e., track1: dialogue scene identification and track2: dialogue session identification) in the MDUG competition.
What problem does this paper attempt to address?