Heterogeneous graphormer for extractive multimodal summarization

Xiankai Jiang,Jingqiang Chen
DOI: https://doi.org/10.1007/s10844-024-00886-5
2024-10-01
Journal of Intelligent Information Systems
Abstract:Multimodal summarization with multimodal output (MSMO) aims to generate summaries that incorporate both text and images. Existing methods have not effectively leveraged intermodal relationships, such as sentence-image relationships, which are crucial for generating high-quality multimodal summaries. In this paper, we propose a heterogeneous graph-based model for multimodal summarization (HGMS) designed to efficiently leverage intermodal relationships within multimodal data. The model constructs a heterogeneous graph based on the relationships between modalities, containing nodes for words, sentences and images. An enhanced Graphormer is then proposed to update node representations, aiming to more effectively model intricate relationships between multiple modalities. To the best of our knowledge, we are the first to apply Graphormer in the field of graph-based summarization. Experimental results on a large-scale benchmark dataset demonstrate that HGMS achieves state-of-the-art performance in terms of automatic metrics and human evaluations.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?