Abstract:Image saliency detection, to which much effort has been devoted in recent years, has advanced significantly. In contrast, the community has paid little attention to video saliency detection. Especially, existing video saliency models are very likely to fail in videos with difficult scenarios such as fast motion, dynamic background, and nonrigid deformation. Furthermore, performing video saliency detection directly using image saliency models that ignore video temporal information is inappropriate. To alleviate this issue, this study proposes a novel end-to-end spatiotemporal integration network (STI-Net) for detecting salient objects in videos. Specifically, our method is made up of three key steps: feature aggregation, saliency prediction, and saliency fusion, which are used sequentially to generate spatiotemporal deep feature maps, coarse saliency predictions, and the final saliency map. The key advantage of our model lies in the comprehensive exploration of spatial and temporal information across the entire network, where the two features interact with each other in the feature aggregation step, are used to construct boundary cue in the saliency prediction step, and also serve as the original information in the saliency fusion step. As a result, the generated spatiotemporal deep feature maps can precisely and completely characterize the salient objects, and the coarse saliency predictions have well-defined boundaries, effectively improving the final saliency map's quality. Furthermore, “shortcut connections” are introduced into our model to make the proposed network easy to train and obtain accurate results when the network is deep. Extensive experimental results on two publicly available challenging video datasets demonstrate the effectiveness of the proposed model, which achieves comparable performance to state-of-the-art saliency models.

DevsNet: Deep Video Saliency Network Using Short-term and Long-term Cues

TSMSAN: A Three-Stream Multi-Scale Attentive Network for Video Saliency Detection.

Spatio-Temporal Self-Attention Network for Video Saliency Prediction

Video Saliency Prediction Based on Spatial-Temporal Two-Stream Network

Deep3DSaliency: Deep Stereoscopic Video Saliency Detection Model by 3D Convolutional Networks.

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Deep Learning for Video Saliency Detection.

Video Saliency Estimation via Encoding Deep Spatiotemporal Saliency Cues

STI-Net: Spatiotemporal integration network for video saliency detection.

Video Saliency Detection Using Deep Convolutional Neural Networks.

DeepVS2.0: A Saliency-Structured Deep Learning Method for Predicting Dynamic Visual Attention

End-to-End Video Saliency Detection Via a Deep Contextual Spatiotemporal Network

Video Salient Object Detection via Fully Convolutional Networks

Video Saliency Prediction Via Spatio-Temporal Reasoning

A Novel Spatio-Temporal 3D Convolutional Encoder-Decoder Network for Dynamic Saliency Prediction

DeepVS: A Deep Learning Based Video Saliency Prediction Approach

Video Saliency Prediction using Spatiotemporal Residual Attentive Networks.

Specificity-preserving RGB-D Saliency Detection

A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction.

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

Video Salient Object Detection Network with Bidirectional Memory and Spatiotemporal Constraints.