Abstract:Sound propagation is the process by which sound energy travels through a medium, such as air, to the surrounding environment as sound waves. The room impulse response (RIR) describes this process and is influenced by the positions of the source and listener, the room's geometry, and its materials. Physics-based acoustic simulators have been used for decades to compute accurate RIRs for specific acoustic environments. However, we have encountered limitations with existing acoustic simulators. To address these limitations, we propose three novel solutions. First, we introduce a learning-based RIR generator that is two orders of magnitude faster than an interactive ray-tracing simulator. Our approach can be trained to input both statistical and traditional parameters directly, and it can generate both monaural and binaural RIRs for both reconstructed and synthetic 3D scenes. Our generated RIRs outperform interactive ray-tracing simulators in speech-processing applications, including ASR, Speech Enhancement, and Speech Separation. Secondly, we propose estimating RIRs from reverberant speech signals and visual cues without a 3D representation of the environment. By estimating RIRs from reverberant speech, we can augment training data to match test data, improving the word error rate of the ASR system. Our estimated RIRs achieve a 6.9% improvement over previous learning-based RIR estimators in far-field ASR tasks. We demonstrate that our audio-visual RIR estimator aids tasks like visual acoustic matching, novel-view acoustic synthesis, and voice dubbing, validated through perceptual evaluation. Finally, we introduce IR-GAN to augment accurate RIRs using real RIRs. IR-GAN parametrically controls acoustic parameters learned from real RIRs to generate new RIRs that imitate different acoustic environments, outperforming Ray-tracing simulators on the far-field ASR benchmark by 8.95%.

Realization of Global Audio Telepresence Via a Learning-Based Model-Matching Approach with an Acoustic Array System

Learning-based Array Configuration-Independent Binaural Audio Telepresence with Scalable Signal Enhancement and Ambience Preservation

Microphone array processing via joint wideband angle-of-arrival estimation and speech feature enhancement

Model-matching Principle Applied to the Design of an Array-based All-neural Binaural Rendering System for Audio Telepresence

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Brain-controlled augmented hearing for spatially moving conversations in multi-talker environments

Autonomous In-Situ Soundscape Augmentation via Joint Selection of Masker and Gain

Enabling Real-Time On-Chip Audio Super Resolution for Bone-Conduction Microphones

Efficient learning-based sound propagation for virtual and real-world audio processing applications

Learning-based personal speech enhancement for teleconferencing by exploiting spatial-spectral features

Learning an Interpretable End-to-End Network for Real-Time Acoustic Beamforming

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

Space-and-speaker-aware Acoustic Modeling with Effective Data Augmentation for Recognition of Multi-Array Conversational Speech

Multichannel Learning-Based Spatially Extended Active Noise Control Via Model Matching and Sensor Transfer Function Interpolation

Audio-visual multi-channel speech separation, dereverberation and recognition

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Array2BR: An End-to-End Noise-immune Binaural Audio Synthesis from Microphone-array Signals

Deep Learning Based Stage-wise Two-dimensional Speaker Localization with Large Ad-hoc Microphone Arrays

An Environment Adaptive Loudspeaker Calibration Method for Ambisonics Decoding System

Joint model-based recognition and localization of overlapped acoustic events using a set of distributed small microphone arrays