Spatiotemporal Two-Stream LSTM Network for Unsupervised Video Summarization

Min Hu,Ruimin Hu,Zhongyuan Wang,Zixiang Xiong,Rui Zhong
DOI: https://doi.org/10.1007/s11042-022-12901-4
IF: 2.577
2022-01-01
Multimedia Tools and Applications
Abstract:Within user-created videos, the constantly changing content among neighboring images brings more challenge for the prior video summarization methods. Assuming the images’ critical features are refined, one can obtain promising accuracy of keyframes’ selection which is key in video summarization. In our work, we innovatively proposed a Spatiotemporal two-stream LSTM network-based (ST-LSTM) model to enhance the images’ critical features with the combination of spatial saliency and temporal semantic dependencies which is referred to as the two-stream method. Motivated by the fact that sizable and moving objects attract more visual attention, we newly design a Saliency-area-based attention network to filter irrelative non-attractive information. We use the latest attention-based Bi-LSTM network to extract the temporal dependency on the semantic features. Furthermore, a multi-feature-based reward function is presented to reinforce the ST-LSTM model by integrating diversity, representativeness, and storyness. Last, the Deep Deterministic Policy Gradient (DDPG) algorithm is adopted to do the unsupervised training for the proposed method. Extensive experiments on the public datasets demonstrate that our method outperforms the state-of-the-art.
What problem does this paper attempt to address?