Instance-Level Panoramic Audio-Visual Saliency Detection and Ranking

Ruohao Guo,Dantong Niu,Liao Qu,Yanyu Qi,Ji Shi,Wenzhen Yue,Bowei Xing,Taiyan Chen,Xianghua Ying
DOI: https://doi.org/10.1145/3664647.3681070
2024-01-01
Abstract:Panoramic audio-visual saliency detection is to segment the most attention-attractive regions in 360° panoramic videos with sound. To meticulously delineate the detected salient regions and effectively model human attention shift, we extend this task to more fine-grained instance scenarios: identifying salient object instances and inferring their saliency ranks. In this paper, we propose the first instance-level framework that can simultaneously be applied to segmentation and ranking of multiple salient objects in panoramic videos. Specifically, it consists of a distortion-aware pixel decoder to overcome panoramic distortions, a sequential audio-visual fusion module to integrate audio-visual information, and a spatio-temporal object decoder to separate individual instances and predict their saliency scores. Moreover, owing to the absence of such annotations, we create the ground-truth saliency ranks for the PAVS10K benchmark. Extensive experiments demonstrate that our model is capable of achieving state-of-the-art performance on the PAVS10K for both saliency detection and ranking tasks. The code is available at https://github.com/ruohaoguo/pavsodr.
What problem does this paper attempt to address?