ASOD60K: An Audio-Induced Salient Object Detection Dataset for Panoramic Videos
Yi Zhang
DOI: https://doi.org/10.48550/arXiv.2107.11629
2021-11-12
Abstract:Exploring to what humans pay attention in dynamic panoramic scenes is useful for many fundamental applications, including augmented reality (AR) in retail, AR-powered recruitment, and visual language navigation. With this goal in mind, we propose PV-SOD, a new task that aims to segment salient objects from panoramic videos. In contrast to existing fixation-/object-level saliency detection tasks, we focus on audio-induced salient object detection (SOD), where the salient objects are labeled with the guidance of audio-induced eye movements. To support this task, we collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy, thus distinguishing itself with richness, diversity and quality. Specifically, each sequence is marked with both its super-/sub-class, with objects of each sub-class being further annotated with human eye fixations, bounding boxes, object-/instance-level masks, and associated attributes (e.g., geometrical distortion). These coarse-to-fine annotations enable detailed analysis for PV-SOD modelling, e.g., determining the major challenges for existing SOD models, and predicting scanpaths to study the long-term eye fixation behaviors of humans. We systematically benchmark 11 representative approaches on ASOD60K and derive several interesting findings. We hope this study could serve as a good starting point for advancing SOD research towards panoramic videos. The dataset and benchmark will be made publicly available at <a class="link-external link-https" href="https://github.com/PanoAsh/ASOD60K" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition