Abstract:This paper describes the specification, design and development phases of two widely used telephone services based on automatic speech recognition. The effort spent for evaluating and tuning these services will be discussed in detail.In developing the first service, mainly based on the recognition of “alphanumeric” sequences, a significant part of the work consisted in refining the acoustic models. To increase recognition accuracy we adopted algorithms and methods consolidated in the past over broadcast news transcription tasks. A significant result shows that the use of task specific context dependent phone models reduces the word error rate by about 40% relative to using context independent phone models. Note that the latter result was achieved over a small vocabulary task, significantly different from those generally used in broadcast news transcription.We also investigated both unsupervised and supervised training procedures. Moreover, we studied a novel partly supervised technique that allows us to select in some “optimal” way the speech material to manually transcribe and use for acoustic model training. A significant result shows that the proposed procedure gives performance close to that obtained with a completely supervised training method.In the second service, mainly based on phrase spotting, a wide effort was devoted to language model refinement. In particular, several types of rejection networks were studied to detect out of vocabulary words for the given task; a major result demonstrates that using rejection networks based on a class trigram language model reduces the word error rate from 36.7% to 11.1% with respect to using a phone loop network. For the latter service, the benefits and related costs brought by regular grammars, stochastic language models and mixed language models will be also reported and discussed.Finally, notice that most of experiments described in this paper were carried out on field databases collected through the developed services.

Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews

Human and Automatic Speech Recognition Performance on German Oral History Interviews

Agmma: A Novel Incremental Adaptation Method And Its Application To Speaker Recognition

Replay to Remember: Continual Layer-Specific Fine-tuning for German Speech Recognition

Open Source Automatic Speech Recognition for German

Automatic Speech Recognition : A Study and Performance Evaluation on Neural Networks and Hidden Markov Models

wav2vec and its current potential to Automatic Speech Recognition in German for the usage in Digital History: A comparative assessment of available ASR-technologies for the use in cultural heritage contexts

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Modelling human speech recognition in challenging noise maskers using machine learning

Linguistic-Coupled Age-to-Age Voice Translation to Improve Speech Recognition Performance in Real Environments

Design and evaluation of acoustic and language models for large scale telephone services

Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Using Kaldi for Automatic Speech Recognition of Conversational Austrian German

Developing Acoustic Models for Automatic Speech Recognition in Swedish

Multi-Microphone Noise Data Augmentation for DNN-based Own Voice Reconstruction for Hearables in Noisy Environments

A Study of Enhancement, Augmentation, and Autoencoder Methods for Domain Adaptation in Distant Speech Recognition

Speaker- and Age-Invariant Training for Child Acoustic Modeling Using Adversarial Multi-Task Learning

Domain Adaptation with Augmented Data by Deep Neural Network Based Method Using Re-Recorded Speech for Automatic Speech Recognition in Real Environment

Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric and Elderly Speech Recognition

Transfer Learning for Acoustic Modeling of Noise Robust Speech Recognition