Action Parsing-Driven Video Summarization Based on Reinforcement Learning
Jie Lei,Qiao Luan,Xinhui Song,Xiao Liu,Dapeng Tao,Mingli Song
DOI: https://doi.org/10.1109/tcsvt.2018.2860797
IF: 5.859
2018-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:How to manage, store, and index large numbers of videos is an urgent problem to be solved. Although there are many video summarization models achieving good results, models based on low-level features cannot summarize important semantic information and models based on semantic analysis need related text descriptions that do not exist for most videos. As a consequence, the mining semantic information contained in the video itself is a more feasible way. In this paper, we propose an action parsing-driven video summarization model based on reinforcement learning. The model is mainly divided into two parts, video cut by action parsing and video summarization based on reinforcement learning. In the first part, a sequential multiple instance learning model is trained with weakly annotated data to solve the problem of full annotation’s time consuming and weak annotation’s ambiguity. In the second part, we design a deep recurrent neural network-based video summarization model that selects the most distinguishable frames comparing with other actions. Meanwhile, the quality of the extracted key frames could be evaluated by the categorization accuracy. Experiments and comparison with state-of-the-art methods demonstrate the advantage of the proposed approach.