HGTMFS: A Hypergraph Transformer Framework for Multimodal Summarization

Ming Lu,Xinxi Lu,Xiaoming Zhang
DOI: https://doi.org/10.3390/app14209563
2024-01-01
Abstract:Multimodal summarization, a rapidly evolving field within multimodal learning, focuses on generating cohesive summaries by integrating information from diverse modalities, such as text and images. Unlike traditional unimodal summarization, multimodal summarization presents unique challenges, particularly in capturing fine-grained interactions between modalities. Current models often fail to account for complex cross-modal interactions, leading to suboptimal performance and an over-reliance on one modality. To address these issues, we propose a novel framework, hypergraph transformer-based multimodal summarization (HGTMFS), designed to model high-order relationships across modalities. HGTMFS constructs a hypergraph that incorporates both textual and visual nodes and leverages transformer mechanisms to propagate information within the hypergraph. This approach enables the efficient exchange of multimodal data and improves the integration of fine-grained semantic relationships. Experimental results on several benchmark datasets demonstrate that HGTMFS outperforms state-of-the-art methods in multimodal summarization.
What problem does this paper attempt to address?