Abstract:Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task, p anoramic a udio v isual s alient o bject d etection, ( PAV-SOD 1 ), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K , the first p anoramic video dataset for a udio v isual s alient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD , and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K . Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our C VAE-based a udio v isual net work, namely, CAV-Net , consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K . With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.

Audio-Visual Saliency for Omnidirectional Videos

How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio

Unified Audio-Visual Saliency Model for Omnidirectional Videos with Spatial Audio

How Sound Affects Visual Attention in Omnidirectional Videos.

Perceptual Quality Assessment of Omnidirectional Audio-visual Signals

Audio-visual Aligned Saliency Model for Omnidirectional Video with Implicit Neural Representation Learning

SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection.

A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

ASOD60K: An Audio-Induced Salient Object Detection Dataset for Panoramic Videos

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

Audio-visual Saliency Prediction for Movie Viewing in Immersive Environments: Dataset and Benchmarks

Shifting More Attention to Video Salient Object Detection.

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

Lavs - A Lightweight Audio-Visual Saliency Prediction Model.

Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking

Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Audio-visual saliency prediction with multisensory perception and integration

A Novel Lightweight Audio-visual Saliency Model for Videos

Saliency Prediction of Sports Videos: A Large-Scale Database and a Self-Adaptive Approach