Abstract:We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects. ImmerseDiffusion is trained to generate first-order ambisonics (FOA) audio, which is a conventional spatial audio format comprising four channels that can be rendered to multichannel spatial output. The proposed generative system is composed of a spatial audio codec that maps FOA audio to latent components, a latent diffusion model trained based on various user input types, namely, text prompts, spatial, temporal and environmental acoustic parameters, and optionally a spatial audio and text encoder trained in a Contrastive Language and Audio Pretraining (CLAP) style. We propose metrics to evaluate the quality and spatial adherence of the generated spatial audio. Finally, we assess the model performance in terms of generation quality and spatial conformance, comparing the two proposed modes: ``descriptive", which uses spatial text prompts) and ``parametric", which uses non-spatial text prompts and spatial parameters. Our evaluations demonstrate promising results that are consistent with the user conditions and reflect reliable spatial fidelity.
What problem does this paper attempt to address?
### Problems the paper attempts to solve
The paper aims to solve the problem of generating three - dimensional immersive audio (3D immersive soundscapes), especially in response to the growing demand for immersive audio experiences in fields such as virtual reality (VR), augmented reality (AR), healthcare, education, and entertainment. Although existing generative audio models have made significant progress in generating mono or stereo sounds, they lack the ability to accurately position sound sources at the desired spatial locations. For example, these models cannot generate a sound that only comes from the left channel based on the prompt "There is a dog barking in the left channel".
To solve this problem, the paper introduces **ImmerseDiffusion**, an end - to - end generative audio model that can generate 3D immersive audio according to the spatial, temporal, and environmental conditions of sound objects. Specifically, ImmerseDiffusion generates First - Order Ambisonics (FOA) audio, a traditional spatial audio format with four channels that can be rendered into a multi - channel spatial output.
### Main contributions
1. **Generate high - quality 3D immersive audio**:
- ImmerseDiffusion can generate 3D audio with high spatial fidelity, supporting positioning in the horizontal, vertical, and distance dimensions, and taking into account temporal and environmental factors (such as room size and reverberation).
2. **Two generation modes**:
- **Descriptive Mode**: Generate spatial audio based on text descriptions, suitable for narrative - driven applications such as movie audio.
- **Parametric Mode**: Generate spatial audio by combining text descriptions and numerical spatial parameters, suitable for machine - centered applications such as game engines and virtual simulations.
3. **New evaluation metrics**:
- To evaluate the quality and spatial consistency of the generated FOA audio, the paper proposes new evaluation metrics, including Ambisonics Fréchet Audio Distance (FAD), Spatial Kullback - Leibler (KL) divergence, and Spatial CLAP score. In addition, the L1 score of azimuth, elevation, and distance based on the sound intensity vector is also used to evaluate spatial accuracy.
### Technical methods
1. **Spatial audio codec**:
- Use a 1D convolutional U - Net autoencoder to compress the 4 - channel FOA signal into a 64 - channel latent representation with a compression ratio of 128.
2. **Conditional generation**:
- The descriptive mode uses the text encoder of the ELSA model to encode text prompts that describe the sound source and its spatial and environmental details as conditions.
- The parametric mode uses the text encoder of the LAION CLAP model to provide non - spatial text embeddings and combines numerical spatial parameters and environmental parameters.
3. **Diffusion model**:
- Use a Transformer - based diffusion model (Diffusion Transformer) to generate spatial audio through self - attention, cross - attention, and gated MLP components.
### Experimental results
- The **Descriptive Mode** is significantly better than the **Parametric Mode** in terms of FAD and ELSA CLAP scores, which may be because the **Descriptive Mode** uses ELSA embeddings as conditions.
- The **Parametric Mode** performs comparably to the **Descriptive Mode** in terms of KL divergence and CLAP scores, and even slightly outperforms it in some metrics, especially showing higher accuracy in the Direction of Arrival (DoA) metric.
### Conclusion
ImmerseDiffusion is an innovative generative spatial audio model that can generate high - quality 3D immersive audio under user - defined conditions. The experimental results show that this model has high accuracy and reliability in generating spatial audio and is suitable for a variety of application scenarios.