Transformer-based Video Summarization with Spatial-Temporal Representation

Suru Feng,Yuxiang Xie,Yingmei Wei,Jie Yan,Qi Wang
DOI: https://doi.org/10.1109/bigdia56350.2022.9874248
2022-01-01
Abstract:Video summarization is an important topic studied by researchers. With the application of deep learning, CNN and RNN have also been used to generate video summarization. However, because a video contains many frames and the video timing span is large, the spatial-temporal architecture of the video is complex, but it is necessary to abstract the spatial-temporal structure information to generate a summarization, and it is also the focus of the researchers recently. Based on previous researchers' research, we put forward a new way for video summary generation, which consists of three deep neural network models. First, a 2D convolutional CNN is used to process video frames, convert a short video into a vector form that can be flexibly calculated, and then use 1D convolution to perform sequence analysis on the timing information of data, and then use the Transformer encorder model that is currently used in natural language processing to further extract timing information, and finally use up-sampling to obtain output to make its dimension the same as video frames number in input short video. Through training learning allows the model to get importance scores which indicate the importance of video frames, and then select key shots to obtain video summaries. Experimental data illustrates that our model get better performance than existing methods on two generic datasets.
What problem does this paper attempt to address?