Modelling human speech recognition in challenging noise maskers using machine learning

Birger Kollmeier,Constantin Spille,Angel Mario Castro Martínez,Stephan D. Ewert,Bernd T. Meyer
DOI: https://doi.org/10.1250/ast.41.94
2020-01-01
Acoustical Science and Technology
Abstract:The advantage and limitations of utilizing automatic speech recognition (ASR) techniques for modelling human speech recognition are investigated for a set of ``critical'' speech maskers for which many standard models of human speech recognition fail. A deep neural net (DNN)-based ASR system utilizing a closed-set sentence recognition test is used to model the speech recognition threshold (SRT) of normal-hearing listeners for a variety of noise types. The benchmark data from Schubotz et al. (2016) include SRTs measured in conditions with an increasing complexity in terms of spectro-temporal modulation (from stationary speech-shaped noise to a single interfering talker). The DNN-based model as proposed in Spille et al. (2018) produces a higher prediction accuracy than baseline models (i.e., SII, ESII, STOI, and mr-sESPM) even though it does not require a clean speech reference signal (as is the case for most auditory model-based SRT predictions). The most accurate predictions are obtained with multi-condition training with known noise types and ASR features that explicitly account for temporal modulations in noisy sentences. Another advantage of the approach is that the DNN can serve as valuable analysis tool to uncover signal recognition strategies: For instance, by identifying the most relevant cues for correct classification in modulated noise, it is shown that the DNN is listening in the dips. Finally, we present preliminary data indicating that the WER of the model can be replaced with an estimate of the WER, which does not require the transcript of utterances during test time and therefore eliminates an important limitation of the previous model that prevents it from being used in real-world scenarios.
What problem does this paper attempt to address?