Abstract:Video memorability measures the degree to which a video is remembered by different viewers and has shown great potential in various contexts, including advertising, education, and health care. While extensive research has been conducted on image memorability, the study of video memorability is still in its early stages. Existing methods in this field primarily focus on coarse-grained spatial feature representation and decision fusion strategies, overlooking the crucial interactions between spatial and temporal domains. Therefore, we propose an end-to-end collaborative spatial-temporal network called VMemNet, which incorporates targeted attention mechanisms and intermediation fusion strategies. This enables VMemNet to capture the intricate relationships between spatial and temporal information and uncover more elements of memorability within video visual features. VMemNet integrates spatially and semantically guided attention modules into a dual-stream network architecture, allowing it to simultaneously capture static local cues and dynamic global cues in videos. Specifically, the spatial attention module is used to aggregate more memorable elements from spatial locations, and the semantically guided attention module is used to achieve semantic alignment and intermediate fusion of the local and global cues. In addition, two types of loss functions with complementary decision rules are associated with the corresponding attention modules to guide the training process of the proposed network. Experimental results obtained on a publicly available dataset verify that the proposed VMemNet approach outperforms all current single- and multi-modal methods in terms of video memorability prediction.

UNIMEMnet: Learning Long-Term Motion and Appearance Dynamics for Video Prediction with a Unified Memory Network

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

VMemNet: A Deep Collaborative Spatial-Temporal Network with Attention Representation for Video Memorability Prediction

A novel spatio-temporal memory network for video anomaly detection

Motion-Aware Feature Enhancement Network for Video Prediction

Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning

Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning

From Single to Multiple: Leveraging Multi-level Prediction Spaces for Video Forecasting

Multiscale Spatial and Temporal Learning for Human Motion Prediction

MAU: A Motion-Aware Unit for Video Prediction and Beyond

Video Frame Prediction by Deep Multi-Branch Mask Network

STAM: A SpatioTemporal Attention Based Memory for Video Prediction

Unsupervised Learning of Long-Term Motion Dynamics for Videos

MMVP: Motion-Matrix-based Video Prediction

Integrated Multiscale Appearance Features and Motion Information Prediction Network for Anomaly Detection

Exploring and Exploiting High-Order Spatial-Temporal Dynamics for Long-Term Frame Prediction

A lightweight multi-granularity asymmetric motion mode video frame prediction algorithm

Unit Frame 3 : T + 2 Frame 1 : T Frame 2 : T + 1 FrameT + 1 FrameT + 2 FrameT + 3

Video Frame Prediction with Dual-Stream Deep Network Emphasizing Motions and Content Details.

Motion Graph Unleashed: A Novel Approach to Video Prediction

ASTM - an Attention Based Spatiotemporal Model for Video Prediction Using 3D Convolutional Neural Networks.