Abstract:The video highlight detection task is to localize key elements (moments of user's major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain the background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel three-dimensional (3-D) (spatial+temporal) attention model that can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus, the more effective feature of the video segment is obtained to predict the highlight score. The proposed 3-D attention scheme can be easily integrated into a conventional end-to-end deep ranking model that aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube and SumMe datasets demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods. With the proposed 3-D attention model, video highlights can be accurately retrieved in spatial and temporal dimensions without human supervision in several domains, such as gymnastics, parkour, skating, skiing, surfing, and dog activities, on the public datasets.

Compact Bilinear Augmented Query Structured Attention for Sport Highlights Classification

A New Action Recognition Framework for Video Highlights Summarization in Sporting Events

Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection

Query As Supervision: Towards Low-Cost and Robust Video Moment and Highlight Retrieval

SportsCap: Monocular 3D Human Motion Capture and Fine-grained Understanding in Challenging Sports Videos

Context-aware Learning for Automatic Sports Highlight Recognition

Using Spatial‐Temporal Attention for Video Quality Evaluation

Sports Video Analysis on Large-Scale Data

Human Behavior Analysis for Highlight Ranking in Broadcast Racket Sports Video.

Unifying Global and Local Scene Entities Modelling for Precise Action Spotting

Perceptual Visual Feature Learning With Applications in Sports Educational Image Understanding

A Semi-Automatic Feature Selecting Method For Sports Video Highlight Annotation

Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection

Long Video Scoring Method Fusing High-Precision Pose and Spatio-Temporal Attention Modules

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports

Action recognition using attention-based spatio-temporal VLAD networks and adaptive video sequences optimization

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization

Attention in Attention: Modeling Context Correlation for Efficient Video Classification

ACA-Net: adaptive context-aware network for basketball action recognition

Approximated Bilinear Modules for Temporal Modeling