Abstract:Video memorability measures the degree to which a video is remembered by different viewers and has shown great potential in various contexts, including advertising, education, and health care. While extensive research has been conducted on image memorability, the study of video memorability is still in its early stages. Existing methods in this field primarily focus on coarse-grained spatial feature representation and decision fusion strategies, overlooking the crucial interactions between spatial and temporal domains. Therefore, we propose an end-to-end collaborative spatial-temporal network called VMemNet, which incorporates targeted attention mechanisms and intermediation fusion strategies. This enables VMemNet to capture the intricate relationships between spatial and temporal information and uncover more elements of memorability within video visual features. VMemNet integrates spatially and semantically guided attention modules into a dual-stream network architecture, allowing it to simultaneously capture static local cues and dynamic global cues in videos. Specifically, the spatial attention module is used to aggregate more memorable elements from spatial locations, and the semantically guided attention module is used to achieve semantic alignment and intermediate fusion of the local and global cues. In addition, two types of loss functions with complementary decision rules are associated with the corresponding attention modules to guide the training process of the proposed network. Experimental results obtained on a publicly available dataset verify that the proposed VMemNet approach outperforms all current single- and multi-modal methods in terms of video memorability prediction.

Fine-grained Key-Value Memory Enhanced Predictor for Video Representation Learning

Learning Quality-aware Dynamic Memory for Video Object Segmentation

Fast Real-Time Video Object Segmentation with a Tangled Memory Network

Memory-augmented Dense Predictive Coding for Video Representation Learning

VMemNet: A Deep Collaborative Spatial-Temporal Network with Attention Representation for Video Memorability Prediction

UNIMEMnet: Learning Long-Term Motion and Appearance Dynamics for Video Prediction with a Unified Memory Network

A novel spatio-temporal memory network for video anomaly detection

Adaptive Hierarchical Motion-Focused Model for Video Prediction.

Learning effective feature representation for video object segmentation via memory

Masked Motion Encoding for Self-Supervised Video Representation Learning

Video Prediction Recalling Long-term Motion Context via Memory Alignment Learning

Enhancing Motion Visual Cues for Self-Supervised Video Representation Learning

Adaptive Focus for Efficient Video Recognition

XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model

Memory enhanced global-local aggregation for video object detection

Memory Enhanced Global-Local Aggregation for Video Object Detection.

STAM: A SpatioTemporal Attention Based Memory for Video Prediction

An End-to-End Future Frame Prediction Method for Vehicle-Centric Driving Videos

Feature Augmented Memory with Global Attention Network for VideoQA

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization