Abstract:During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being “informative” or “noninformative”, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.

DA-ResNet: dual-stream ResNet with attention mechanism for classroom video summary

Creating Memorable Video Summaries That Satisfy the User's Intention for Taking the Videos.

Creating Personalized Video Summaries Via Semantic Event Detection

Memorable and Rich Video Summarization

A Human-Machine Collaborative Video Summarization Framework Using Pupillary Response Signals

An Unsupervised Video Summarization Method Based on Multimodal Representation.

Learning User Interest with Improved Triplet Deep Ranking and Web-Image Priors for Topic-Related Video Summarization.

A GAN Based Video Summarization Method with Representation Loss

Video Summarization Using Knowledge Distillation-Based Attentive Network

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

Exploring global diverse attention via pairwise temporal relation for video summarization

Efficient video summarization through MobileNetSSD: a robust deep learning-based framework for efficient video summarization focused on objects of interest

Effective Video Summarization Using Channel Attention-Assisted Encoder–Decoder Framework

Video summarization via knowledge-aware multimodal deep networks

Deep Semantic and Attentive Network for Unsupervised Video Summarization

Unsupervised video summarization with adversarial graph-based attention network

Video Summarization Overview

A video summarization framework based on activity attention modeling using deep features for smart campus surveillance system

Convolutional Hierarchical Attention Network for Query-Focused Video Summarization.

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

Video Summarization Based on Feature Fusion and Data Augmentation