Unified Audio-Visual Saliency Model for Omnidirectional Videos with Spatial Audio

Dandan Zhu,Kaiwei Zhang,Nana Zhang,Qiangqiang Zhou,Xiongkuo Min,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1109/tmm.2023.3271022
IF: 7.3
2023-01-01
IEEE Transactions on Multimedia
Abstract:Spatial audio is a crucial component of omnidirectional videos (ODVs), which can provide an immersive experience by enabling viewers to perceive sound sources in all directions. However, most visual attention modeling works for ODVs focus only on visual cues, and audio modality is rather rarely considered. Additionally, the existing audio-visual saliency models for ODVs lack spatial audio location-awareness (i.e. sound source location-agnostic) and audio content attributes discriminability (i.e. audio content attributes-agnostic). To this end, we propose a novel audio-visual perception saliency (AVPS) model with spatial audio location-awareness and audio content attributes-adaptive to efficiently address the problem of fixation prediction in ODVs. Specifically, we first utilize the improved group equivariant convolutional neural network (G-CNN) with eidetic 3D LSTM (E3D-LSTM) to extract spatial-temporal visual features. Then we perceive sound source locations by computing the audio energy map (AEM) of the audio information in ODVs. Subsequently, we introduce SoundNet to extract audio features with multiple attributes. Finally, we develop an audio-visual feature fusion module to adaptively integrate spatial-temporal visual features and spatial auditory information to generate the final audio-visual saliency map. Extensive experiments in three audio modalities validate the effectiveness of the proposed model. Meanwhile, the performance of the proposed model is superior to the other 10 state-of-the-art saliency models.
What problem does this paper attempt to address?