Abstract:With the development of media and networking technologies, multimedia applications ranging from feature presentation in a cinema setting to video on demand to interactive video conferencing are in great demand. Good synchronization between audio and video modalities is a key factor towards defining the quality of a multimedia presentation. The audio and visual signals of a multimedia presentation are commonly managed by independent workflows - they are often separately authored, processed, stored and even delivered to the playback system. This opens up the possibility of temporal misalignment between the two modalities - such a tendency is often more pronounced in the case of produced content (such as movies). To judge whether audio and video signals of a multimedia presentation are synchronized, we as humans often pay close attention to discriminative spatio-temporal blocks of the video (e.g. synchronizing the lip movement with the utterance of words, or the sound of a bouncing ball at the moment it hits the ground). At the same time, we ignore large portions of the video in which no discriminative sounds exist (e.g. background music playing in a movie). Inspired by this observation, we study leveraging attention modules for automatically detecting audio-visual synchronization. We propose neural network based attention modules, capable of weighting different portions (spatio-temporal blocks) of the video based on their respective discriminative power. Our experiments indicate that incorporating attention modules yields state-of-the-art results for the audio-visual synchronization classification problem.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the synchronization between audio and video signals in multimedia presentations. With the development of media and network technologies, from the functional displays in cinemas to video - on - demand services on the Internet, and then to real - time applications such as live video streaming applications and interactive video conferences, the demand for multimedia applications is increasing day by day. In these applications, good synchronization between audio and video modalities is one of the key factors defining the quality of multimedia presentations. However, since audio and video signals are usually managed by independent workflows, this may lead to a time misalignment between the two, especially in produced content (such as movies), and the possibility of such misalignment is more significant. To solve this problem, the authors propose an attention - module - based method for automatically detecting audio - visual synchronization. Specifically, they design the attention module of the neural network, which can weight different parts (spatio - temporal blocks) of the video according to their discriminative ability. In this way, the network can focus more on those parts that are crucial for judging synchronization and ignore those parts without obvious sounds or that are irrelevant. This method not only improves the accuracy of classification but also accelerates the convergence speed of the model. The experimental results show that the introduction of the temporal attention module and the spatio - temporal attention module can significantly improve the performance of the audio - visual synchronization classification problem, especially when dealing with voice - category data, the performance improvement of the spatio - temporal attention module is more obvious than that of using only the temporal attention module. This proves the effectiveness and robustness of the proposed attention mechanism in practical applications.

On Attention Modules for Audio-Visual Synchronization

Event-Based Multimodal Spiking Neural Network with Attention Mechanism

A Bimodal Learning Approach to Assist Multi-sensory Effects Synchronization

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

Synchformer: Efficient Synchronization from Sparse Cues

Rethinking Audio-visual Synchronization for Active Speaker Detection

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization

Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization

Attention-Guided Neural Networks for Full-Reference and No-Reference Audio-Visual Quality Assessment

Audiovisual Dependency Attention for Violence Detection in Videos

Joint Visual and Audio Learning for Video Highlight Detection

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

Measuring Human Perception of Audiovisual Errors Using EEG

Audiovisual Highlight Detection in Videos

Audio-Visual Speaker Verification via Joint Cross-Attention

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Audio-Visual Fusion for Emotion Recognition in the Valence-Arousal Space Using Joint Cross-Attention

How Does Audio Influence Visual Attention in Omnidirectional Videos? Database and Model

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features