Abstract:Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos, this paper presents a multi-sensory framework for AVSP. This framework is designed to extract audio, motion and image saliency features and integrate them effectively, which can then serve as a general architecture for the AVSP task. To obtain multi-sensory information, we develop a three-stream encoder that extracts audio, motion and image saliency features. In particular, we utilize a pre-trained encoder with knowledge related to image saliency to extract saliency features for each frame. The image saliency features are then incorporated with motion features using a spatial attention module. For motion features, 3D convolutional neural networks (CNNs) like S3D are commonly used in AVSP models. However, these networks are unable to effectively capture the global motion relationship in videos. To tackle this problem, we incorporate Transformer- and MLP-based motion encoders into the AVSP models. To learn joint audio-visual representations, an audio-visual fusion block is exploited to enhance the correlation between audio and visual motion features under the supervision of a cosine similarity loss in a self-supervised manner. Finally, a multi-stage decoder integrates audio, motion and image saliency features to generate the final saliency map. We evaluate our methods on six audio-visual eye-tracking datasets. Experimental results demonstrate that our method achieves compelling performance compared to the state-of-the-art methods. The source code will be available at https://github.com/oraclefina/MSPI .

Saliency-Based Spatiotemporal Attention for Video Captioning

Learning Stereoscopic Visual Attention Model for 3d Video

STAT: Spatial-Temporal Attention Mechanism for Video Captioning

Semantic-Driven Saliency-Context Separation for Video Captioning

Motion Guided Spatial Attention for Video Captioning.

Top-down Visual Saliency Guided by Captions

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network

Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling.

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Hybrid Attention Spatial-Temporal Network for Video Saliency Prediction

Video Captioning With Attention-Based LSTM and Semantic Consistency

A Multimodal Saliency Model For Videos With High Audio-Visual Correspondence

Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning

Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning

Audio-visual saliency prediction with multisensory perception and integration

Motion Guided Region Message Passing for Video Captioning

Video Captioning in Compressed Video

Multimodal Video Saliency Analysis with User-Biased Information.

Adaptive Spatial Location with Balanced Loss for Video Captioning