Multi-Attention Audio-Visual Fusion Network for Audio Spatialization

Wen Zhang,Jie Shao
DOI: https://doi.org/10.1145/3460426.3463624
2021-01-01
Abstract:In our daily life, we are exposed to a large number of video files. Compared with video containing only mono audio, video with stereo can provide us with better audio-visual experience. However, a large number of ordinary users do not have professional equipment to record videos with high-quality stereo. In order to make it more convenient for users to obtain videos with stereo, we propose an effective method to convert mono audio in the video into stereo. One of the keys to this task is how to effectively inject visual information extracted from video frames into the audio signal. We design a novel multi-attention fusion network (MAFNet) based on the self-attention mechanism to extract the spatial features related to the sound source in the video frames and fuse them into audio features well. Furthermore, in order to obtain stereo with higher quality, we design an additional iterative structure which can refine and optimize the generated stereo sound by several iterations. Our proposed approach is validated on two challenging video datasets (FAIR-Play and YT-MUSIC), and achieves new state-of-the-art performance.
What problem does this paper attempt to address?