Lavs - A Lightweight Audio-Visual Saliency Prediction Model.

Dandan Zhu,Defang Zhao,Xiongkuo Min,Tian Han,Qiangqiang Zhou,Shaobo Yu,Yongqing Chen,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1109/ICME51207.2021.9428415
2021-01-01
Abstract:Audio information is essential for guiding human attention and visual perception, which has been verified by many comprehensive psychological studies. However, the audio modality has been rather neglected in modeling visual attention, most of the current visual attention models heavily depend on visual information. Additionally, current existing high-performing visual attention models rely on deeper convolution neural networks (CNNs), benefiting from their extraordinary feature learning ability but incurring high computational cost. To this end, we propose a novel lightweight audio-visual saliency (LAVS) model to efficiently address the problem of fixation prediction in videos. To the best of our knowledge, our proposed model constitutes the first attempt to exploit a lightweight network and combines the visual and audio cues to perform saliency estimation in videos. Specifically, our proposed model consists of four modules, which are spatial-temporal visual saliency estimation module, audio features extraction module, source sound localization module, and audio-visual saliency fusion module. Extensive experiments across datasets validate the effectiveness and real-time performance of the proposed LAVS model, which outperforms the other state-of-the-art methods.
What problem does this paper attempt to address?