Diff-SAGe: End-to-End Spatial Audio Generation Using Diffusion Models

Saksham Singh Kushwaha,Jianbo Ma,Mark R. P. Thomas,Yapeng Tian,Avery Bruni
2024-10-15
Abstract:Spatial audio is a crucial component in creating immersive experiences. Traditional simulation-based approaches to generate spatial audio rely on expertise, have limited scalability, and assume independence between semantic and spatial information. To address these issues, we explore end-to-end spatial audio generation. We introduce and formulate a new task of generating first-order Ambisonics (FOA) given a sound category and sound source spatial location. We propose Diff-SAGe, an end-to-end, flow-based diffusion-transformer model for this task. Diff-SAGe utilizes a complex spectrogram representation for FOA, preserving the phase information crucial for accurate spatial cues. Additionally, a multi-conditional encoder integrates the input conditions into a unified representation, guiding the generation of FOA waveforms from noise. Through extensive evaluations on two datasets, we demonstrate that our method consistently outperforms traditional simulation-based baselines across both objective and subjective metrics.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in spatial audio generation: 1. **Limitations of traditional methods**: - **Dependence on expert knowledge**: Traditional simulation - based spatial audio generation methods require experts to make manual adjustments, which are not only time - consuming but also difficult to apply on a large scale. - **Assumption of independence**: These methods usually assume that the acoustic content and spatial information are independent, but in fact this assumption does not always hold. For example, bird songs are usually highly directional and usually come from above. - **Lack of scalability**: Traditional methods perform poorly when dealing with multimodal experiences (such as visual - to - spatial - audio generation) and are difficult to scale. 2. **End - to - end spatial audio generation**: - **Direct generation**: The paper proposes an end - to - end spatial audio generation method that can directly generate first - order Ambisonics (FOA) audio from the given sound category and the spatial position of the sound source without the need for iterative and interactive adjustments. - **Preserving phase information**: Unlike traditional mono - audio generation models, this method uses a complex spectrogram to represent FOA, retaining the phase information, which is crucial for generating accurate spatial cues. 3. **New task definition**: - **Generating FOA**: The paper defines a new task, that is, generating FOA audio according to the given sound category and the spatial position of the sound source. - **Model design**: Diff - SAGe, a flow - based diffusion transformer model, is proposed for generating spatial audio from noise. This model integrates input conditions into a unified representation through a multi - conditional encoder to guide the generation of FOA waveforms. 4. **Evaluation and comparison**: - **Performance verification**: Through extensive experiments on two datasets, it is proved that Diff - SAGe outperforms traditional simulation - based baseline methods in both objective and subjective indicators. In summary, the main goal of this paper is to develop an efficient, accurate and scalable end - to - end spatial audio generation method to overcome the limitations of traditional methods.