Abstract:The video highlight detection task is to localize key elements (moments of user's major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain the background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel three-dimensional (3-D) (spatial+temporal) attention model that can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus, the more effective feature of the video segment is obtained to predict the highlight score. The proposed 3-D attention scheme can be easily integrated into a conventional end-to-end deep ranking model that aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube and SumMe datasets demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods. With the proposed 3-D attention model, video highlights can be accurately retrieved in spatial and temporal dimensions without human supervision in several domains, such as gymnastics, parkour, skating, skiing, surfing, and dog activities, on the public datasets.

Deep3DRanker: A Novel Framework for Learning to Rank 3D Models with Self-Attention in Robotic Vision

A Data-efficient Framework for Robotics Large-scale LiDAR Scene Parsing

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds

3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding

On the Efficacy of 3D Point Cloud Reinforcement Learning

A Unified Framework for 3D Point Cloud Visual Grounding

From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection

Point Cloud Matters: Rethinking the Impact of Different Observation Spaces on Robot Learning

Target Recognition and Location Based on Deep Learning

A Novel CNN Architecture for Real-Time Point Cloud Recognition in Road Environment

Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting

Three-Dimensional Attention-Based Deep Ranking Model for Video Highlight Detection

3-D LiDAR Localization Based on Novel Nonlinear Optimization Method for Autonomous Ground Robot

Robotic picking in dense clutter via domain invariant learning from synthetic dense cluttered rendering

Deep Metric Learning with Self-Supervised Ranking.

3D Hierarchical Refinement and Augmentation for Unsupervised Learning of Depth and Pose From Monocular Video

Drone Referring Localization: An Efficient Heterogeneous Spatial Feature Interaction Method For UAV Self-Localization

DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map

3D attention-driven depth acquisition for object identification.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck