Abstract:Modern neural-network-based speech processing systems are typically required to be robust against reverberation, and the training of such systems thus needs a large amount of reverberant data. During the training of the systems, on-the-fly simulation pipeline is nowadays preferred as it allows the model to train on infinite number of data samples without pre-generating and saving them on harddisk. An RIR simulation method thus needs to not only generate more realistic artificial room impulse response (RIR) filters, but also generate them in a fast way to accelerate the training process. Existing RIR simulation tools have proven effective in a wide range of speech processing tasks and neural network architectures, but their usage in on-the-fly simulation pipeline remains questionable due to their computational complexity or the quality of the generated RIR filters. In this paper, we propose FRAM-RIR, a fast random approximation method of the widely-used image-source method (ISM), to efficiently generate realistic multi-channel RIR filters. FRAM-RIR bypasses the explicit calculation of sound propagation paths in ISM-based algorithms by randomly sampling the location and number of reflections of each virtual sound source based on several heuristic assumptions, while still maintains accurate direction-of-arrival (DOA) information of all sound sources. Visualization of oracle beampatterns and directional features shows that FRAM-RIR can generate more realistic RIR filters than existing widely-used ISM-based tools, and experiment results on multi-channel noisy speech separation and dereverberation tasks with a wide range of neural network architectures show that models trained with FRAM-RIR can also achieve on par or better performance on real RIRs compared to other RIR simulation tools with a significantly accelerated training procedure. A Python implementation of FRAM-RIR is released.

TS-RIR: Translated synthetic room impulse responses for speech augmentation

IR-GAN: Room Impulse Response Generator for Far-field Speech Recognition

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

AV-RIR: Audio-Visual Room Impulse Response Estimation

Efficient learning-based sound propagation for virtual and real-world audio processing applications

FAST-RIR: Fast neural diffuse room impulse response generator

Room Impulse Responses help attackers to evade Deep Fake Detection

A study on more realistic room simulation for far-field keyword spotting

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification

Few-Shot Audio-Visual Learning of Environment Acoustics

Hearing Anything Anywhere

Fast Random Approximation of Multi-channel Room Impulse Response

Intermediate Fine-Tuning Using Imperfect Synthetic Speech for Improving Electrolaryngeal Speech Recognition

Speech Recognition with Augmented Synthesized Speech

Real-time pre-processing for improved feature extraction of noisy speech

Blind estimation of room acoustic parameters from speech signals based on extended model of room impulse response

Room Impulse Response Estimation using Optimal Transport: Simulation-Informed Inference

Compression of room impulse responses for compact storage and fast low-latency convolution

FRA-RIR: Fast Random Approximation of the Image-source Method

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Deep Room Impulse Response Completion