Audio-visual Saliency Prediction for Movie Viewing in Immersive Environments: Dataset and Benchmarks

Zhao Chen,Kao Zhang,Hao Cai,Xiaoying Ding,Chenxi Jiang,Zhenzhong Chen
DOI: https://doi.org/10.1016/j.jvcir.2024.104095
IF: 2.887
2024-01-01
Journal of Visual Communication and Image Representation
Abstract:In this paper, an eye-tracking dataset of movie viewing in the immersive environment is developed, which contains 256 movie clips with 2K QHD resolution and corresponding movie genre labels from IMDb (Internet Movie Database). The dataset provides the audio-visual clues for studying the human visual attention when watching movie using a VR headset, by recording the eye movements using integrated eye tracker. To provide benchmarks for a saliency prediction for movie viewing in the immersive environment, fifteen computational models are evaluated on the dataset, including a newly developed multi-stream audio-visual saliency prediction model based on deep neural networks, named as MSAV. Detailed quantitative and qualitative comparisons and analyses are also provided. The developed dataset and benchmarks could help to facilitate the studies of visual saliency prediction for movie viewing in the immersive environments.
What problem does this paper attempt to address?