Audio-visual Saliency Prediction Model with Implicit Neural Representation

Nana Zhang,Min Xiong,Dandan Zhu,Kun Zhu,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1145/3698881
2024-01-01
Abstract:With the remarkable advancement of deep learning techniques and the wide availability of large-scale datasets, the performance of audio-visual saliency prediction has been drastically improved. Actually, audio-visual saliency prediction is still at an early exploration stage due to the spatial-temporal signal complexity and dynamic continuity of video content. To our knowledge, most existing audio-visual saliency prediction approaches usually represent videos as 3D grid of RGB values using discrete convolution neural networks (CNNs), which inevitably incurs video content-agnostic and ignores the dynamic continuity issues. This paper proposes a novel parametric audio-visual saliency (PAVS) model with implicit neural representation (INR) to address the aforementioned problems. Specifically, by using the proposed parametric neural network, we can effectively encode the space-time coordinates of video frames into corresponding saliency values, which can significantly enhance the compact feature representation ability. Meanwhile, a parametric feature fusion method is developed to achieve intrinsic interactions between audio and visual information streams, which can adaptively fuse audio and visual features to obtain competitive performance. Notably, without resorting to any specific audio-visual feature fusion strategy, the proposed PAVS model outperforms other state-of-the-art saliency methods by a large margin.
What problem does this paper attempt to address?