Abstract:Conventional video saliency detection methods frequently follow the common bottom-up thread to estimate video saliency within the short-term fashion. As a result, such methods can not avoid the obstinate accumulation of errors when the collected low-level clues are constantly ill-detected. Also, being noticed that a portion of video frames, which are not nearby the current video frame over the time axis, may potentially benefit the saliency detection in the current video frame. Thus, we propose to solve the aforementioned problem using our newly-designed key frame strategy (KFS), whose core rationale is to utilize both the spatial-temporal coherency of the salient foregrounds and the objectness prior (i.e., how likely it is for an object proposal to contain an object of any class) to reveal the valuable long-term information. We could utilize all this newly-revealed long-term information to guide our subsequent "self-paced" saliency diffusion, which enables each key frame itself to determine its diffusion range and diffusion strength to correct those ill-detected video frames. At the algorithmic level, we first divide a video sequence into short-term frame batches, and the object proposals are obtained in a frame-wise manner. Then, for each object proposal, we utilize a pre-trained deep saliency model to obtain high-dimensional features in order to represent the spatial contrast. Since the contrast computation within multiple neighbored video frames (i.e., the non-local manner) is relatively insensitive to the appearance variation, those object proposals with high-quality low-level saliency estimation frequently exhibit strong similarity over the temporal scale. Next, the long-term common consistency (e.g., appearance models/movement patterns) of the salient foregrounds could be explicitly revealed via similarity analysis accordingly. We further boost the detection accuracy via long-term information guided saliency diffusion in a self-paced manner. We have conducted extensive experiments to compare our method with 16 state-of-the-art methods over 4 largest public available benchmarks, and all results demonstrate the superiority of our method in terms of both accuracy and robustness.

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Lavs - A Lightweight Audio-Visual Saliency Prediction Model.

Contrastive Conditional Latent Diffusion for Audio-visual Segmentation

CASP-Net: Rethinking Video Saliency Prediction from an Audio-VisualConsistency Perceptual Perspective

Deep Audio-Visual Fusion Neural Network for Saliency Estimation.

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

Unified Audio-Visual Saliency Model for Omnidirectional Videos with Spatial Audio

Video Saliency Detection Via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion

Structure-Aware Adaptive Diffusion for Video Saliency Detection

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model with Implicit Neural Representations

Audio-visual saliency prediction with multisensory perception and integration

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

A Novel Lightweight Audio-visual Saliency Model for Videos

Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models

Learning Coupled Convolutional Networks Fusion for Video Saliency Prediction

Accurate and Robust Video Saliency Detection Via Self-Paced Diffusion.

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation