From Discrete Representation to Continuous Modeling: A Novel Audio-Visual Saliency Prediction Model with Implicit Neural Representations

Dandan Zhu,Kaiwei Zhang,Kun Zhu,Nana Zhang,Weiping Ding,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1109/tetci.2024.3386619
2024-01-01
IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:In the era of deep learning, audio-visual saliency prediction is still in its infancy due to the complexity of video signals and the continuous correlation in the temporal dimension. Most existing approaches treat videos as 3D grids of RGB values and model them using discrete neural networks, leading to issues such as video content-agnostic and sub-optimal feature representation ability. To address these challenges, we propose a novel dynamic-aware audio-visual saliency (DAVS) model based on implicit neural representations (INRs). The core of our proposed DAVS model is to build an effective mapping by exploiting a parametric neural network that maps space-time coordinates to the corresponding saliency values. Specifically, our model incorporates an INR-based video generator that decomposes videos into image, motion, and audio feature vectors, learning video content-adaptive features via a parametric neural network. This generator efficiently encodes videos, naturally models continuous temporal dynamics, and enhances feature representation capability. Furthermore, we introduce a parametric audio-visual feature fusion strategy in the saliency prediction procedure, enabling intrinsic interactions between modalities and adaptively integrating visual and audio cues. Through extensive experiments on benchmark datasets, our proposed DAVS model demonstrates promising performance and intriguing properties in audio-visual saliency prediction.
What problem does this paper attempt to address?