Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

Francesc Lluís,Nils Meyer-Kahlen
2024-09-23
Abstract:For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
This paper aims to solve the problem of how to generate the spatial room impulse response (SRIR) of virtual sound sources in augmented reality (AR) applications so that they can be seamlessly integrated into the user's actual acoustic environment. Specifically, the paper proposes a method that can infer the acoustic properties of a room from available sound sources and then use these properties to generate new SRIRs to adapt to new sound sources at different locations. The key to this method is the ability to estimate the acoustic properties of a room based only on the audio signals in the existing sound field without conducting special acoustic measurements, and generate new SRIRs accordingly. ### Main contributions 1. **Encoding of room - specific information**: The paper proposes to use an encoder network trained with contrastive loss to extract features that only contain room - specific information from the complete acoustic scene, that is, features that are independent of the specific positions of the sound source and the receiver. 2. **Diffusion model to generate SRIR**: The paper further proposes a generator based on a diffusion model. This generator can generate a new four - channel SRIR according to the extracted room - specific embedding and the new sound source - receiver position vector. This generator takes into account the characteristics of the room and the position - dependent features, thereby generating perceptually reasonable SRIRs. ### Problems solved - **Limitations of mono - channel RIR estimation**: Most of the existing methods (including those based on deep learning) are only applicable to mono - channel RIR estimation and ignore the directional characteristics in multi - channel SRIR. The paper solves this problem by generating a four - channel SRIR. - **Estimation in a multi - sound - source environment**: Existing methods usually require a specific isolated sound source to achieve the most accurate estimation, while in actual AR applications, the user may be in an acoustic scene containing multiple active sound sources. The method proposed in the paper can handle such a multi - sound - source environment and generate SRIRs at new positions. ### Method overview - **Dataset generation**: Use a synthetic dataset for model training. The dataset includes simulated SRIRs with different room sizes, materials, and sound source positions. - **Encoder training**: The encoder is trained through a contrastive learning framework. The goal is to extract room - specific information while ignoring the specific positions of the sound source and the receiver. - **Generator training**: The generator is trained based on a diffusion model. The input includes the room - specific embedding and the new sound source - receiver position vector, and the output is a four - channel SRIR. ### Experimental results - **Room - specific parameters**: The generated SRIR shows good performance in terms of the reverberation time (RT) in the mid - frequency band and the direct - to - reverberant energy ratio (DRR), and has a high correlation with the real values. - **Position - specific parameters**: The generated SRIR also shows good performance in terms of the DRR at different positions and the direction - of - arrival (DoA) of the direct sound, and can accurately point to the provided sound source position. In conclusion, the paper proposes an innovative method that can generate perceptually reasonable spatial room impulse responses in augmented reality applications, thereby improving the integration of virtual sound sources in the actual acoustic environment.