Hearing Anything Anywhere

Mason Wang,Ryosuke Sawata,Samuel Clarke,Ruohan Gao,Shangzhe Wu,Jiajun Wu
2024-06-12
Abstract:Recent years have seen immense progress in 3D computer vision and computer graphics, with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However, alongside immersive visual experiences, immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper, we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene, a setup that is easily achievable by ordinary users. To this end, we introduce DiffRIR, a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene, including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method, we collect a dataset of RIR recordings and music in four diverse, real environments. We show that our model outperforms state-ofthe-art baselines on rendering monaural and binaural RIRs and music at unseen locations, and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene.
Sound,Computer Vision and Pattern Recognition,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reconstruct the spatial acoustic characteristics of any environment given sparse room impulse response (RIR) measurement data and planar reconstructions of the scene. Specifically, the authors aim to simulate any source audio at any location by using a small number (about 12) of RIR recordings and planar reconstructions of the scene, thereby achieving "Hearing Anything Anywhere". This goal is similar to the sparse - view novel view synthesis (NVS) task in computer vision and graphics, but sound waves are characterized by temporal variability and slow propagation speed compared to light waves, which makes common visual NVS methods unsuitable for audio processing. To achieve this goal, the paper introduces DIFFRIR, a differentiable RIR rendering framework that can explanatorily model significant acoustic feature parameters in the scene, such as the directivity of the sound source and the reflectivity of the surface. Through these models, DIFFRIR can synthesize new auditory experiences at any location in space. The paper also proposes an analysis - based synthesis paradigm to characterize the physically interpretable parameters of the sound source and the surfaces in the scene by optimizing the difference between the output of DIFFRIR and the known RIR measurement values. The main contributions of the paper include: 1. Proposing DIFFRIR, a differentiable acoustic inverse - rendering framework that can recover the immersive sound field of a room from a set of sparsely distributed RIR measurement data. 2. Constructing a new dataset containing real RIRs measured from hundreds of locations in four different real - world environments. 3. By comparing with existing methods in various settings, it is proven that in practical data - limited scenarios, the DIFFRIR method is more effective than existing methods and can predict more accurate RIRs and music at unseen locations. In summary, this research aims to capture the real - world acoustic space with a small amount of hardware setup (such as 12 microphones), which is more practical for many consumer application scenarios.