Abstract:Object-level audiovisual saliency detection in 360° panoramic real-life dynamic scenes is important for exploring and modeling human perception in immersive environments, also for aiding the development of virtual, augmented, and mixed reality applications in fields such as education, social network, entertainment, and training. To this end, we propose a new task, p anoramic a udio v isual s alient o bject d etection, ( PAV-SOD 1 ), which aims to segment the objects grasping most of the human attention in 360° panoramic videos reflecting real-life daily scenes. To support the task, we collect PAVS10K , the first p anoramic video dataset for a udio v isual s alient object detection, which consists of 67 4K-resolution equirectangular videos with per-video labels including hierarchical scene categories and associated attributes depicting specific challenges for conducting PAV-SOD , and 10,465 uniformly sampled video frames with manually annotated object-level and instance-level pixel-wise masks. The coarse-to-fine annotations enable multi-perspective analysis regarding PAV-SOD modeling. We further systematically benchmark 13 state-of-the-art salient object detection (SOD)/video object segmentation (VOS) methods based on our PAVS10K . Besides, we propose a new baseline network, which takes advantage of both visual and audio cues of 360° video frames by using a new conditional variational auto-encoder (CVAE). Our C VAE-based a udio v isual net work, namely, CAV-Net , consists of a spatial-temporal visual segmentation network, a convolutional audio-encoding network, and audiovisual distribution estimation modules. As a result, our CAV-Net outperforms all competing models and is able to estimate the aleatoric uncertainties within PAVS10K . With extensive experimental results, we gain several findings about PAV-SOD challenges and insights towards PAV-SOD model interpretability. We hope that our work could serve as a starting point for advancing SOD towards immersive media.

Unified Audio-Visual Saliency Model for Omnidirectional Videos with Spatial Audio

Towards Audio-Visual Saliency Prediction for Omnidirectional Video with Spatial Audio

Audio-visual Aligned Saliency Model for Omnidirectional Video with Implicit Neural Representation Learning

How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Audio-Visual Saliency for Omnidirectional Videos

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

SVGC-AVA: 360-Degree Video Saliency Prediction with Spherical Vector-Based Graph Convolution and Audio-Visual Attention

A Novel Lightweight Audio-visual Saliency Model for Videos

Lavs - A Lightweight Audio-Visual Saliency Prediction Model.

Audio-visual Saliency Prediction Model with Implicit Neural Representation

Audio-visual saliency prediction with multisensory perception and integration

Audiovisual Saliency Prediction Via Deep Learning

A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!

From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model with Implicit Neural Representations

How Sound Affects Visual Attention in Omnidirectional Videos.

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

PAV-SOD: A New Task Towards Panoramic Audiovisual Saliency Detection.

Deep Audio-Visual Fusion Neural Network for Saliency Estimation.

Learning Explicit and Implicit Latent Common Spaces for Audio-Visual Cross-Modal Retrieval

Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos