Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Peiwen Sun,Sitong Cheng,Xiangtai Li,Zhen Ye,Huadai Liu,Honggang Zhang,Wei Xue,Yike Guo
2024-10-15
Abstract:Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for latent diffusion models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our unified model not only achieves the objective of generating immersive and controllable spatial audio from text and image but also enables interactive audio generation during inference. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.
Sound,Computer Vision and Pattern Recognition,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to generate stereo audio with spatial awareness. Specifically, current audio generation models have achieved remarkable success in generating mono - audio, but when it comes to stereo audio generation, due to the complex scenes which contain multiple objects and directions, controlling the spatial context of stereo audio still faces great challenges. These problems are mainly attributed to high data costs and unstable generation models. To address these challenges, the authors propose the following solutions: 1. **Construct a large - scale dataset**: The authors constructed a large - scale simulated dataset named BEWO - 1M, which contains rich soundscapes and descriptions, including even moving and multi - source scenes. In addition, they also retrieved a set of images and reasonably paired stereo audio through retrieval to promote multi - modal generation. 2. **Introduce a spatial - awareness model**: The authors proposed the SpatialSonic model, which utilizes a spatial - awareness encoder and an azimuth - state matrix to provide reasonable spatial guidance. By using spatial guidance, their unified model can not only generate immersive and controllable stereo audio from text and images, but also support interactive audio generation during the inference process. 3. **Evaluation methods**: The authors conducted subjective and objective evaluations on simulated and real - world data to compare their method with existing methods. The results show that their method performs well in generating stereo audio that conforms to physical rules. ### Specific problems and solutions 1. **Data scale problem**: - **Solution**: Constructed the BEWO - 1M dataset, which contains 1 million audio samples, generated through strict simulation and GPT - assisted subtitle conversion. The dataset covers a variety of soundscapes, including moving - source, multi - source, and interleaved - source scenes, and has been manually inspected to ensure perceptual consistency. 2. **Precise guidance construction problem**: - **Solution**: Introduced an azimuth - fusion module, which uses LLM and specific schemes to generate explicit spatial guidance. Specifically, an azimuth - state matrix \( S\in\mathbb{R}^{K\times L_{\text{azi}}\times d_{\text{time}}} \) is used to encode azimuth information at different time slots. Coarse guidance is generated through a Gaussian distribution, and fine guidance is generated through a discrete - state matrix. 3. **Evaluation metric problem**: - **Solution**: Proposed a series of subjective and objective evaluation metrics based on ITD (inter - aural time difference) and opinion scores. Experimental results show that the SpatialSonic model performs excellently in generating realistic stereo audio, with a 70% reduction in ITD error and higher opinion scores than other popular models. ### Summary The main contributions of this paper are: 1. Constructed a large - scale stereo audio dataset BEWO - 1M, which supports large - scale training and precise evaluation. 2. Proposed a one - stage controllable stereo audio generation framework SpatialSonic, which can generate two - channel audio that precisely follows multi - modal spatial context. 3. Introduced a set of subjective and objective evaluation metrics based on ITD and opinion scores to systematically evaluate the quality of the generated audio. These contributions provide new ideas and technical means for generating stereo audio with spatial awareness and are expected to be widely used in virtual reality, augmented reality, and embodied AI fields.