Abstract:Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos, this paper presents a multi-sensory framework for AVSP. This framework is designed to extract audio, motion and image saliency features and integrate them effectively, which can then serve as a general architecture for the AVSP task. To obtain multi-sensory information, we develop a three-stream encoder that extracts audio, motion and image saliency features. In particular, we utilize a pre-trained encoder with knowledge related to image saliency to extract saliency features for each frame. The image saliency features are then incorporated with motion features using a spatial attention module. For motion features, 3D convolutional neural networks (CNNs) like S3D are commonly used in AVSP models. However, these networks are unable to effectively capture the global motion relationship in videos. To tackle this problem, we incorporate Transformer- and MLP-based motion encoders into the AVSP models. To learn joint audio-visual representations, an audio-visual fusion block is exploited to enhance the correlation between audio and visual motion features under the supervision of a cosine similarity loss in a self-supervised manner. Finally, a multi-stage decoder integrates audio, motion and image saliency features to generate the final saliency map. We evaluate our methods on six audio-visual eye-tracking datasets. Experimental results demonstrate that our method achieves compelling performance compared to the state-of-the-art methods. The source code will be available at https://github.com/oraclefina/MSPI .

Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos

Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model

Specialty may be better: A decoupling multi-modal fusion network for Audio-visual event localization

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

Audio-visual saliency prediction with multisensory perception and integration

Audio-Visual Grouping Network for Sound Localization from Mixtures

Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video

Relevance-guided Audio Visual Fusion for Video Saliency Prediction

MVANet: Multi-Stage Video Attention Network for Sound Event Localization and Detection with Source Distance Estimation

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Audio-Visual Event Localization in Unconstrained Videos

Multiple Sound Sources Localization from Coarse to Fine

Dual Domain-Adversarial Learning for Audio-Visual Saliency Prediction

T-VSL: Text-Guided Visual Sound Source Localization in Mixtures

Attention-based cross-modal fusion for audio-visual voice activity detection in musical video streams

Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Audio-Visual Event Localization by Learning Spatial and Semantic Co-attention