Abstract:This paper presents a practical technique for Automatic speech recognition (ASR) in multiple reverberant environment selection. Multiple ASR models are trained with artificial synthetic room impulse responses (IRs), i.e. simulated room IRs, with different reverberation time (T(60)(Model)s) and tested on real room IRs with varying T(60)(Room)s. To apply our method, the biggest challenge is to choose a proper artificial room IR model for training ASR models. In this paper, a generalised statistical IR model with attenuated reverberation after an early reflection period, named attenuated IR model, has been adopted based on three time-domain statistical IR models. Its optimal values of the reverberation-attenuation factor and the early reflection period on the recognition rate have been searched and determined. Extensive testing has been performed over four real room IR sets (63 IRs in total) with variant T(60)(Room)s and speaker microphone distances (SMDs). The optimised attenuated IR model had the best performance in terms of recognition rate over others. Specific considerations of the practical use of the method have been taken into account including: (i) the maximal training step of T-60(Model) in order to get the minimal number of models with acceptable performance; (ii) the impact of selection errors on the ASR caused by the estimation error of T-60(Room); and (iii) the performance over SMD and direct-to-reverberation energy Ratio (DRR). It is shown that recognition rates of over 80 similar to 90% are achieved in most cases. One important advantage of the method is that T-60(Room) can be estimated either from reverberant sound directly (Takeda et al., 2009; Falk and Chan, 2010; Lollmann et al., 2010) or from an IR measured from any point of the room as it remains constant in the same room (Kuttruff, 2000), thus it is particularly suited to mobile applications. Compared to many classical dereverberation methods, the proposed method is more suited to ASR tasks in multiple reverberant environments, such as human-robot interaction. (C) 2014 Elsevier B.V. All rights reserved.

FAST-RIR: Fast neural diffuse room impulse response generator

Fast Random Approximation of Multi-channel Room Impulse Response

Deep Room Impulse Response Completion

IR-GAN: Room Impulse Response Generator for Far-field Speech Recognition

AV-RIR: Audio-Visual Room Impulse Response Estimation

Efficient learning-based sound propagation for virtual and real-world audio processing applications

FRA-RIR: Fast Random Approximation of the Image-source Method

TS-RIR: Translated synthetic room impulse responses for speech augmentation

Neural Acoustic Context Field: Rendering Realistic Room Impulse Response With Neural Fields

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

MESH2IR: Neural Acoustic Impulse Response Generator for Complex 3D Scenes

Computationally-efficient and perceptually-motivated rendering of diffuse reflections in room acoustics simulation

RGI-Net: 3D Room Geometry Inference from Room Impulse Responses With Hidden First-Order Reflections

Hearing Anything Anywhere

Few-Shot Audio-Visual Learning of Environment Acoustics

Robust Speech Recognition In Reverberant Environments By Using An Optimal Synthetic Room Impulse Response Model

Novel View Acoustic Parameter Estimation

Data-driven 3D Room Geometry Inference with a Linear Loudspeaker Array and a Single Microphone

Blind Spatial Impulse Response Generation from Separate Room- and Scene-Specific Information

Acoustic Volume Rendering for Neural Impulse Response Fields

Specular Path Generation and Near-Reflective Diffraction in Interactive Acoustical Simulations