Abstract:For audio in augmented reality (AR), knowledge of the users' real acoustic environment is crucial for rendering virtual sounds that seamlessly blend into the environment. As acoustic measurements are usually not feasible in practical AR applications, information about the room needs to be inferred from available sound sources. Then, additional sound sources can be rendered with the same room acoustic qualities. Crucially, these are placed at different positions than the sources available for estimation. Here, we propose to use an encoder network trained using a contrastive loss that maps input sounds to a low-dimensional feature space representing only room-specific information. Then, a diffusion-based spatial room impulse response generator is trained to take the latent space and generate a new response, given a new source-receiver position. We show how both room- and position-specific parameters are considered in the final output.

What problem does this paper attempt to address?

This paper aims to solve the problem of how to generate the spatial room impulse response (SRIR) of virtual sound sources in augmented reality (AR) applications so that they can be seamlessly integrated into the user's actual acoustic environment. Specifically, the paper proposes a method that can infer the acoustic properties of a room from available sound sources and then use these properties to generate new SRIRs to adapt to new sound sources at different locations. The key to this method is the ability to estimate the acoustic properties of a room based only on the audio signals in the existing sound field without conducting special acoustic measurements, and generate new SRIRs accordingly. ### Main contributions 1. **Encoding of room - specific information**: The paper proposes to use an encoder network trained with contrastive loss to extract features that only contain room - specific information from the complete acoustic scene, that is, features that are independent of the specific positions of the sound source and the receiver. 2. **Diffusion model to generate SRIR**: The paper further proposes a generator based on a diffusion model. This generator can generate a new four - channel SRIR according to the extracted room - specific embedding and the new sound source - receiver position vector. This generator takes into account the characteristics of the room and the position - dependent features, thereby generating perceptually reasonable SRIRs. ### Problems solved - **Limitations of mono - channel RIR estimation**: Most of the existing methods (including those based on deep learning) are only applicable to mono - channel RIR estimation and ignore the directional characteristics in multi - channel SRIR. The paper solves this problem by generating a four - channel SRIR. - **Estimation in a multi - sound - source environment**: Existing methods usually require a specific isolated sound source to achieve the most accurate estimation, while in actual AR applications, the user may be in an acoustic scene containing multiple active sound sources. The method proposed in the paper can handle such a multi - sound - source environment and generate SRIRs at new positions. ### Method overview - **Dataset generation**: Use a synthetic dataset for model training. The dataset includes simulated SRIRs with different room sizes, materials, and sound source positions. - **Encoder training**: The encoder is trained through a contrastive learning framework. The goal is to extract room - specific information while ignoring the specific positions of the sound source and the receiver. - **Generator training**: The generator is trained based on a diffusion model. The input includes the room - specific embedding and the new sound source - receiver position vector, and the output is a four - channel SRIR. ### Experimental results - **Room - specific parameters**: The generated SRIR shows good performance in terms of the reverberation time (RT) in the mid - frequency band and the direct - to - reverberant energy ratio (DRR), and has a high correlation with the real values. - **Position - specific parameters**: The generated SRIR also shows good performance in terms of the DRR at different positions and the direction - of - arrival (DoA) of the direct sound, and can accurately point to the provided sound source position. In conclusion, the paper proposes an innovative method that can generate perceptually reasonable spatial room impulse responses in augmented reality applications, thereby improving the integration of virtual sound sources in the actual acoustic environment.

Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

Room-aware portable Auditory Augmented Reality: Real-time spatial audio generation with geometric data analysis

Blind Identification of Binaural Room Impulse Responses from Smart Glasses

Auralization based on multi-perspective ambisonic room impulse responses

An end-to-end approach for blindly rendering a virtual sound source in an audio augmented reality environment

Hearing Anything Anywhere

Novel View Acoustic Parameter Estimation

Deep Room Impulse Response Completion

Room Impulse Response Estimation in a Multiple Source Environment

Blind Localization of Room Reflections with Application to Spatial Audio

AV-RIR: Audio-Visual Room Impulse Response Estimation

Blind Acoustic Room Parameter Estimation Using Phase Features

A binaural room impulse response dataset and Shorelining psychophysical task for the evaluation of auditory sensory augmentation

Interpolating the Directional Room Impulse Response for Dynamic Spatial Audio Reproduction

Realistic sources, receivers and walls improve the generalisability of virtually-supervised blind acoustic parameter estimators

Efficient learning-based sound propagation for virtual and real-world audio processing applications

Echo-aware room impulse response generation

Novel-View Acoustic Synthesis from 3D Reconstructed Rooms

Room Acoustic Rendering Networks with Control of Scattering and Early Reflections

Scene-Aware Audio Rendering via Deep Acoustic Analysis

BERP: A Blind Estimator of Room Acoustic and Physical Parameters for Single-Channel Noisy Speech Signals