A Novel Lightweight Audio-visual Saliency Model for Videos

Dandan Zhu,Xuan Shao,Qiangqiang Zhou,Xiongkuo Min,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1145/3576857
IF: 4.094
2022-01-01
ACM Transactions on Multimedia Computing Communications and Applications
Abstract:Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention models only utilize visual information, their performance is limited but also requires high-computational complexity due to the limited information available. To overcome these problems, we propose a lightweight audio-visual saliency (LAVS) model for video sequences. To the best of our knowledge, this article is the first trial to utilize audio cues for an efficient deep-learning model for the video saliency estimation. First, spatial-temporal visual features are extracted by the lightweight receptive field block (RFB) with the bidirectional ConvLSTM units. Then, audio features are extracted by using an improved lightweight environment sound classification model. Subsequently, deep canonical correlation analysis (DCCA) aims at capturing the correspondence between audio and spatial-temporal visual features, thus obtaining a spatial-temporal auditory saliency. Lastly, the spatial-temporal visual and auditory saliency are fused to obtain the audio-visual saliency map. Extensive comparative experiments and ablation studies validate the performance of the LAVS model in terms of effectiveness and complexity.
What problem does this paper attempt to address?