On Attention Modules for Audio-Visual Synchronization

Naji Khosravan,Shervin Ardeshir,Rohit Puri
DOI: https://doi.org/10.48550/arXiv.1812.06071
2018-12-15
Abstract:With the development of media and networking technologies, multimedia applications ranging from feature presentation in a cinema setting to video on demand to interactive video conferencing are in great demand. Good synchronization between audio and video modalities is a key factor towards defining the quality of a multimedia presentation. The audio and visual signals of a multimedia presentation are commonly managed by independent workflows - they are often separately authored, processed, stored and even delivered to the playback system. This opens up the possibility of temporal misalignment between the two modalities - such a tendency is often more pronounced in the case of produced content (such as movies). To judge whether audio and video signals of a multimedia presentation are synchronized, we as humans often pay close attention to discriminative spatio-temporal blocks of the video (e.g. synchronizing the lip movement with the utterance of words, or the sound of a bouncing ball at the moment it hits the ground). At the same time, we ignore large portions of the video in which no discriminative sounds exist (e.g. background music playing in a movie). Inspired by this observation, we study leveraging attention modules for automatically detecting audio-visual synchronization. We propose neural network based attention modules, capable of weighting different portions (spatio-temporal blocks) of the video based on their respective discriminative power. Our experiments indicate that incorporating attention modules yields state-of-the-art results for the audio-visual synchronization classification problem.
Computer Vision and Pattern Recognition,Machine Learning,Multimedia
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the synchronization between audio and video signals in multimedia presentations. With the development of media and network technologies, from the functional displays in cinemas to video - on - demand services on the Internet, and then to real - time applications such as live video streaming applications and interactive video conferences, the demand for multimedia applications is increasing day by day. In these applications, good synchronization between audio and video modalities is one of the key factors defining the quality of multimedia presentations. However, since audio and video signals are usually managed by independent workflows, this may lead to a time misalignment between the two, especially in produced content (such as movies), and the possibility of such misalignment is more significant. To solve this problem, the authors propose an attention - module - based method for automatically detecting audio - visual synchronization. Specifically, they design the attention module of the neural network, which can weight different parts (spatio - temporal blocks) of the video according to their discriminative ability. In this way, the network can focus more on those parts that are crucial for judging synchronization and ignore those parts without obvious sounds or that are irrelevant. This method not only improves the accuracy of classification but also accelerates the convergence speed of the model. The experimental results show that the introduction of the temporal attention module and the spatio - temporal attention module can significantly improve the performance of the audio - visual synchronization classification problem, especially when dealing with voice - category data, the performance improvement of the spatio - temporal attention module is more obvious than that of using only the temporal attention module. This proves the effectiveness and robustness of the proposed attention mechanism in practical applications.