Hierarchical multi‐modal video summarization with dynamic sampling

Lingjian Yu,Xing Zhao,Liang Xie,Haoran Liang,Ronghua Liang
DOI: https://doi.org/10.1049/ipr2.13269
IF: 2.3
2024-10-31
IET Image Processing
Abstract:This work proposes a dynamic sampling module that leverages frame‐level motion information to capture finer details. Combined with a hierarchical multi‐modal structure, it integrates shot‐level visual and textual information to enhance semantic understanding and improve summary accuracy. Previous video summarization methods often neglected inter‐frame variations during the preprocessing stage. Sampling repeated frames can lead to information redundancy, while missing key frames can result in deviations in semantic comprehension and inaccuracies in the generated summaries. This work proposes a dynamic sampling module that leverages frame‐level motion information to alleviate these issues. The module conducts high‐frequency sampling during intervals with significant changes, allowing for a finer capture of details. Combined with a hierarchical multi‐modal structure, it integrates shot‐level visual and textual information to enhance the semantic understanding of video clips and improve the accuracy of the summarized content. Extensive experiments on benchmark datasets SumMe and TVSum demonstrate the effectiveness of the proposed method.
computer science, artificial intelligence,engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?