Abstract:Automatic summarization plays an important role in the exponential document growth on the Web. On content websites such as <a class="link-external link-http" href="http://CNN.com" rel="external noopener nofollow">this http URL</a> and <a class="link-external link-http" href="http://WikiHow.com" rel="external noopener nofollow">this http URL</a>, there often exist various kinds of side information along with the main document for attention attraction and easier understanding, such as videos, images, and queries. Such information can be used for better summarization, as they often explicitly or implicitly mention the essence of the article. However, most of the existing side-aware summarization methods are designed to incorporate either single-modal or multi-modal side information, and cannot effectively adapt to each other. In this paper, we propose a general summarization framework, which can flexibly incorporate various modalities of side information. The main challenges in designing a flexible summarization model with side information include: (1) the side information can be in textual or visual format, and the model needs to align and unify it with the document into the same semantic space, (2) the side inputs can contain information from various aspects, and the model should recognize the aspects useful for summarization. To address these two challenges, we first propose a unified topic encoder, which jointly discovers latent topics from the document and various kinds of side information. The learned topics flexibly bridge and guide the information flow between multiple inputs in a graph encoder through a topic-aware interaction. We secondly propose a triplet contrastive learning mechanism to align the single-modal or multi-modal information into a unified semantic space, where the summary quality is enhanced by better understanding the document and side information. Results show that our model significantly surpasses strong baselines on three public single-modal or multi-modal benchmark summarization datasets.

UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

An Unsupervised Video Summarization Method Based on Multimodal Representation.

D$^2$TV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization

Learning Summary-Worthy Visual Representation for Abstractive Summarization in Video

MHMS: Multimodal Hierarchical Multimedia Summarization

Multimodal Cross-lingual Summarization for Videos: A Revisit in Knowledge Distillation Induced Triple-stage Training Method

Video summarization via knowledge-aware multimodal deep networks

UniSumm: Unified Few-shot Summarization with Multi-Task Pre-Training and Prefix-Tuning

Align vision-language semantics by multi-task learning for multi-modal summarization

UniSumm and SummZoo: Unified Model and Diverse Benchmark for Few-Shot Summarization

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

Multi-task Hierarchical Heterogeneous Fusion Framework for multimodal summarization

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Multization: Multi-Modal Summarization Enhanced by Multi-Contextually Relevant and Irrelevant Attention Alignment

Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization

A Topic-aware Summarization Framework with Different Modal Side Information

Multi-modal Summarization for Video-containing Documents

CFSum: A Coarse-to-Fine Contribution Network for Multimodal Summarization

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

Personalized Video Summarization by Multimodal Video Understanding