Abstract:The video highlight detection task is to localize key elements (moments of user's major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain the background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel three-dimensional (3-D) (spatial+temporal) attention model that can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus, the more effective feature of the video segment is obtained to predict the highlight score. The proposed 3-D attention scheme can be easily integrated into a conventional end-to-end deep ranking model that aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube and SumMe datasets demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods. With the proposed 3-D attention model, video highlights can be accurately retrieved in spatial and temporal dimensions without human supervision in several domains, such as gymnastics, parkour, skating, skiing, surfing, and dog activities, on the public datasets.

Video Highlight Detection Via Region-Based Deep Ranking Model

Video Highlight Detection Via Deep Ranking Modeling

Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

Highlight Detection With Pairwise Deep Ranking For First-Person Video Summarization

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

A Deep Ranking Model for Spatio-Temporal Highlight Detection from a 360 Video

Learning Pixel-Level Distinctions for Video Highlight Detection

Emotion Knowledge Driven Video Highlight Detection

Temporal Cue Guided Video Highlight Detection with Low-Rank Audio-Visual Fusion

MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection

HighlightMe: Detecting Highlights from Human-Centric Videos

Local Attention Sequence Model for Video Object Detection

Unsupervised Modality-Transferable Video Highlight Detection With Representation Activation Sequence Learning

A Semi-Automatic Feature Selecting Method For Sports Video Highlight Annotation

Indirect Match Highlights Detection with Deep Convolutional Neural Networks

Highlight Ranking for Racquet Sports Video in User Attention Subspaces Based on Relevance Feedback

Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence

Video Highlight Prediction Using Audience Chat Reactions

Video Highlights Detection and Summarization with Lag-Calibration based on Concept-Emotion Mapping of Crowd-sourced Time-Sync Comments