Abstract:Predicting where people look in static scenes, a.k.a visual saliency, has received significant research interest recently. However, relatively less effort has been spent in understanding and modeling visual attention over dynamic scenes. This work makes three contributions to video saliency research. First, we introduce a new benchmark, called DHF1K (Dynamic Human Fixation 1K), for predicting fixations during dynamic scene free-viewing, which is a long-time need in this field. DHF1K consists of 1K high-quality elaborately-selected video sequences annotated by 17 observers using an eye tracker device. The videos span a wide range of scenes, motions, object types and backgrounds. Second, we propose a novel video saliency model, called ACLNet (Attentive CNN-LSTM Network), that augments the CNN-LSTM architecture with a supervised attention mechanism to enable fast end-to-end saliency learning. The attention mechanism explicitly encodes static saliency information, thus allowing LSTM to focus on learning a more flexible temporal saliency representation across successive frames. Such a design fully leverages existing large-scale static fixation datasets, avoids overfitting, and significantly improves training efficiency and testing performance. Third, we perform an extensive evaluation of the state-of-the-art saliency models on three datasets : DHF1K, Hollywood-2, and UCF sports. An attribute-based analysis of previous saliency models and cross-dataset generalization are also presented. Experimental results over more than 1.2K testing videos containing 400K frames demonstrate that ACLNet outperforms other contenders and has a fast processing speed (40 fps using a single GPU). Our code and all the results are available at https://github.com/wenguanwang/DHF1K.

A Dataset and Evaluation Methodology for Visual Saliency in Video

Video Saliency Detection Using Motion Saliency Filter

Spatio-temporal salience based video quality assessment

Revisiting Video Saliency: A Large-scale Benchmark and a New Model

Shifting More Attention to Video Salient Object Detection.

Benchmark 3D eye-tracking dataset for visual saliency prediction on stereoscopic 3D video

Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video

A saliency dataset for 360-degree videos

Revisiting Video Saliency Prediction in the Deep Learning Era

A New Method For Spatiotemporal Textual Saliency Detection In Video

Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling.

Review of Visual Saliency Detection with Comprehensive Information

Unified Image and Video Saliency Modeling

Deep Learning for Video Saliency Detection.

A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

AIM 2024 Challenge on Video Saliency Prediction: Methods and Results

A Locally Weighted Fixation Density-Based Metric for Assessing the Quality of Visual Saliency Predictions

Finding Visual Saliency in Continuous Spike Stream

Video-based Salient Object Detection Via Spatio-Temporal Difference and Coherence

An Object-Oriented Visual Saliency Detection Framework Based on Sparse Coding Representations

Audio-visual Saliency Prediction for Movie Viewing in Immersive Environments: Dataset and Benchmarks