Integrating Stickers into Multimodal Dialogue Summarization: A Novel Dataset and Approach for Enhancing Social Media Interaction

Yuanchen Shi,Fang Kong
DOI: https://doi.org/10.1145/3664647.3680978
2024-01-01
Abstract:With the popularity of social media, growing number of online chats and comments are presented in the form of multimodal dialogues containing stickers. Automatically summarizing these dialogues can effectively reduce content overload and save reading time. However, existing datasets and works are either text dialogue summarization, or articles with real photos that respectively perform text summaries and key image extraction, and have not simultaneously considered the multimodal dialogue automatic summarization tasks with sticker images and online chat scenarios. To compensate for the lack of datasets and researches in this field, we propose a brand-new Multimodal Chat Dialogue Summarization Containing Stickers (MCDSCS) task and dataset. It consists of 5,527 Chinese multimodal chat dialogues and 14,356 different sticker images, with each dialogue interspersed with stickers in the text to reflect the real social media chat scenario. MCDSCS can also contribute to filling the gap in Chinese multimodal dialogue data. We use the most advanced GPT4 model and carefully design Chain-of-Thoughts (COT) supplemented with manual review to generate dialogues and extract summaries. We also propose a novel method that integrates the visual information of stickers with the text descriptions of emotions and intentions (TEI). Experiments show that our method can effectively improve the performance of various mainstream summary generation models, even better than some other multimodal models, ChatGPT, and Vision Large Language Models (VLMs). Our data and code are publicly available at https://github.com/FakerBoom/MCDSCS.
What problem does this paper attempt to address?