Abstract:Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.

Hybrid Attention Spatial-Temporal Network for Video Saliency Prediction

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

Learning Stereoscopic Visual Attention Model for 3d Video

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Video Saliency Prediction using Spatiotemporal Residual Attentive Networks.

Video Saliency Forecasting Transformer

Audio-visual saliency prediction with multisensory perception and integration

CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective

Spectral-Spatial-Temporal Attention Network for Hyperspectral Tracking.

STAM: A SpatioTemporal Attention Based Memory for Video Prediction

Human Vision Attention Mechanism-Inspired Temporal-Spatial Feature Pyramid for Video Saliency Detection

End-to-End Video Saliency Detection Via a Deep Contextual Spatiotemporal Network

Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

Predicting 360° Video Saliency: A ConvLSTM Encoder-Decoder Network with Spatio-temporal Consistency

Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling.

Transformer-based Video Saliency Prediction with High Temporal Dimension Decoding

Revisiting Video Saliency Prediction in the Deep Learning Era

GASP: Gated Attention For Saliency Prediction

A Spatial–Channel–Temporal-Fused Attention for Spiking Neural Networks

A Gated Fusion Network for Dynamic Saliency Prediction

Spatial and Temporal Visual Attention Prediction in Videos Using Eye Movement Data