Abstract:Video summarization is a critical task in video analysis that aims to create a brief yet informative summary of the original video (i.e., a set of keyframes) while retaining its primary content. Supervised summarization methods rely on time-consuming keyframe labeling and thus often suffer from the insufficiency issue of training data. In contrast, the performance of unsupervised summarization methods is often unsatisfactory due to the lack of semantically-meaningful guidance on the keyframe selection. In this study, we propose a novel self-supervised video summarization framework with the help of computational optimal transport techniques. Specifically, we generate textual descriptions from video shots and learn the projection from the textual embeddings to the visual ones together with an optimal transport plan between them via solving an inverse optimal transport problem. We propose an alternating optimization algorithm to solve this problem efficiently and design an effective mechanism in the algorithm to avoid trivial solutions. Given the optimal transport plan and the underlying distance between the projected textual embeddings and the visual ones, we synthesize pseudo-significance scores for video frames and leverage the scores as offline supervision to train a keyframe selector. Without subjective and error-prone manual annotations, the proposed framework surpasses previous unsupervised methods in producing high-quality results for generic and instructional video summarization tasks, whose performance even is comparable to those supervised competitors. The code is available at https://github.com/Dixin-s-Lab/Video-Summary-IOT.

SCCS: Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

An Unsupervised Video Summarization Method Based on Multimodal Representation.

MHMS: Multimodal Hierarchical Multimedia Summarization

MSMO: Multimodal Summarization with Multimodal Output

Multimodal Summarization with Guidance of Multimodal Reference

CISum: Learning Cross-modality Interaction to Enhance Multimodal Semantic Coverage for Multimodal Summarization

Abstractive Sentence Summarization with Guidance of Selective Multimodal Reference.

CMFF_VS：A Video Summarization Extraction Model Based on Cross-modal Feature Fusion

TLDW: Extreme Multimodal Summarisation of News Videos

VMSMO: Learning to Generate Multimodal Summary for Video-based News Articles

VideoXum: Cross-modal Visual and Textural Summarization of Videos

CTNR: Compress-then-Reconstruct Approach for Multimodal Abstractive Summarization

UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos

SITransformer: Shared Information-Guided Transformer for Extreme Multimodal Summarization

Realizing Video Summarization from the Path of Language-based Semantic Understanding

Hierarchical Cross-Modality Semantic Correlation Learning Model for Multimodal Summarization

Self-supervised Video Summarization Guided by Semantic Inverse Optimal Transport

SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism

Query-Oriented Micro-Video Summarization

Summary-Oriented Vision Modeling for Multimodal Abstractive Summarization