MTCAM: A Novel Weakly-Supervised Audio-Visual Saliency Prediction Model with Multi-Modal Transformer

Dandan Zhu,Kun Zhu,Weiping Ding,Nana Zhang,Xiongkuo Min,Guangtao Zhai,Xiaokang Yang
DOI: https://doi.org/10.1109/tetci.2024.3358184
2024-01-01
IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:Although various video saliency models have achieved considerable performance gains, existing deep learning-based audio-visual saliency prediction models are still in the early exploration stage. The major challenge is that there are relatively few audio-visual sequences with real human eye fixations collected under the audio-visual circumstance. To this end, this paper presents a novel multi-modal transformer-based class activation mapping (MTCAM) model in a weakly-supervised training manner to effectively alleviate the need of large-scale datasets for audio-visual saliency prediction. In particular, by using only video category labels in the video classification task, we propose to employ the class activation mapping based on multi-modal transformer, which follows a two-stage training methodology to extract the most discriminative regions. Such regions with strong discriminative ability are highly consistent with real human eye fixations. Meanwhile, we further devise an efficient feature reuse mechanism to reduce redundant computation and enable previously obtained features can provide effective guidance for downstream model learning. It is particularly noteworthy that this work is the first attempt to exploit the cross-modal transformer to focus on cross-modal interaction at the entire video and predict human eye fixations in a weakly-supervised training strategy. We conduct extensive experiments on several benchmark datasets to demonstrate that the proposed MTCAM model significantly outperforms other competitors. Furthermore, detailed ablation experiments are also performed to validate the effectiveness and rationality of each component in our proposed model.
What problem does this paper attempt to address?