Visual-guided scene-aware audio generation method based on hierarchical feature codec and rendering decision

Ruiqi Wang,Haonan Cheng,Long Ye,Qin Zhang
DOI: https://doi.org/10.1016/j.displa.2024.102708
IF: 3.074
2024-07-01
Displays
Abstract:Visually guided spatial sound generation (VGSSG) is a well-suited multimodal learning method for dealing with recorded videos. However, existing methods are difficult to be directly applied to spatial sound generation for movie clips. This is mainly due to (1) the existence of Cinematic Audiovisual Language (CAL) in movies, which makes it difficult to construct spatial sound mapping models directly through data-driven based methods. (2) The problem of the inadequate model performance, which is caused by the excessive heterogeneous gap between audiovisual modal information. To solve the aforementioned problems, we propose a VGSSG method based on CAL decision-making and hierarchical feature coding and decoding, which effectively accomplishes spatial sound generation based on the CAL of movies. Specifically, to solve the problem of CAL modeling, a multimodal information-guided movie audio rendering decision maker is established, which can decide the rendering strategy based on the CAL of the current clip. To narrow the heterogeneous gap that hinders the fusion between audiovisual modal data, we propose a codec structure based on hierarchical fusion of audiovisual features and full-scale skip-connections, which improves the efficiency of the comprehensive utilization of audiovisual modal data, and demonstrates the effectiveness of adopting shallow features in VGSSG task. We integrate both 2-channel and 6-channel spatial audio generation into a unified framework. In addition, we establish a movie audiovisual bimodal dataset with hand-crafted CAL annotations. Experimentally, we demonstrate that compared with the existing methods, our method has higher performance in terms of reducing generation distortion.
engineering, electrical & electronic,instruments & instrumentation,optics,computer science, hardware & architecture
What problem does this paper attempt to address?