SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field

Yuhang He,Shitong Xu,Jia-Xing Zhong,Sangyun Shin,Niki Trigoni,Andrew Markham
2024-06-17
Abstract:We present SPEAR, a continuous receiver-to-receiver acoustic neural warping field for spatial acoustic effects prediction in an acoustic 3D space with a single stationary audio source. Unlike traditional source-to-receiver modelling methods that require prior space acoustic properties knowledge to rigorously model audio propagation from source to receiver, we propose to predict by warping the spatial acoustic effects from one reference receiver position to another target receiver position, so that the warped audio essentially accommodates all spatial acoustic effects belonging to the target position. SPEAR can be trained in a data much more readily accessible manner, in which we simply ask two robots to independently record spatial audio at different positions. We further theoretically prove the universal existence of the warping field if and only if one audio source presents. Three physical principles are incorporated to guide SPEAR network design, leading to the learned warping field physically meaningful. We demonstrate SPEAR superiority on both synthetic, photo-realistic and real-world dataset, showing the huge potential of SPEAR to various down-stream robotic tasks.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to solve the problem of predicting spatial acoustic effects from receiver - to - receiver in three - dimensional enclosed spaces. Specifically, the authors propose a new framework - SPEAR (Spatial Perceptual Acoustic Neural Warping Field) for predicting spatial acoustic effects such as reverberation, loudness variation, and resonance at any given receiver position. Traditional methods mainly rely on source - to - receiver modeling methods. These methods require a large amount of prior knowledge to accurately simulate the process of sound propagation from the source to the receiver, such as the geometric layout of the room, material properties, and the position of the sound source. However, obtaining this prior knowledge is very difficult in practical applications and has high computational complexity. In addition, although some neural - network - based methods can learn continuous acoustic fields, they still require a large amount of RIR data for training, which is also difficult to achieve in real - world scenarios. To solve these problems, SPEAR proposes a completely new perspective: **from the receiver - to - receiver perspective**. By using only the audio data independently recorded by two receivers at different positions, SPEAR can learn and predict spatial acoustic effects without relying on complex RIR data or prior acoustic properties. This method is not only easier to obtain data, but also can efficiently predict spatial acoustic effects at any position. ### Main contributions 1. **Proposed a novel receiver - to - receiver spatial acoustic effect prediction framework**: - SPEAR does not require traditional RIR data or difficult - to - obtain prior spatial acoustic properties, but is trained with more easily obtainable data (i.e., audio recorded by receivers at different positions). 2. **Theoretically proved the existence of the receiver - to - receiver neural warping field**: - The authors prove that when there is a single stationary sound source in 3D space, the receiver - to - receiver neural warping field is ubiquitous, and the designed network structure is based on three physical principles (globality, order perception, and audio - content - independence), making it physically meaningful. 3. **Demonstrated the superior performance of SPEAR on synthetic data, photo - realistic data, and real - world datasets**: - The experimental results show that SPEAR performs well on various datasets, especially has great potential in predicting spatial acoustic effects and is suitable for a variety of downstream robotic tasks. ### Mathematical formula representation - **Problem definition**: \[ W_{pr \to pt} = F_\theta(p_t, p_r); \quad \hat{X}_{pr \to pt}(f) = W_{pr \to pt} \cdot X_{pr}(f) \] where \(\theta\) is the trainable parameter, \(p_r\) and \(p_t\) are the reference position and the target position respectively, and \(X_{pr}(f)\) and \(\hat{X}_{pr \to pt}(f)\) are the discrete Fourier transform representations of the audio recorded at the reference position and the warped audio at the target position respectively. - **Optimization objective**: \[ F_\theta \leftarrow \arg \min_\theta L(\hat{X}_{p1 \to p2}(f), X_{p2}(f)), \quad \forall p_1, p_2 \in P \] In this way, SPEAR can effectively predict the spatial acoustic effects at any position, thus providing strong support for tasks such as immersive 3D audio experiences, robot relocalization, and manipulation.