End-to-End Architectures for Speech Recognition

Yajie Miao,Florian Metze
DOI: https://doi.org/10.1007/978-3-319-64680-0_13
2017-01-01
Abstract:Automatic speech recognition (ASR) has traditionally integrated ideas from many different domains, such as signal processing (mel-frequency cepstral coefficient features), natural language processing (n-gram language models), or statistics (hidden markov models). Because of this “compartmentalization,” it is widely accepted that components of an ASR system will largely be optimized individually and in isolation, which will negatively influence overall performance. End-to-end approaches attempt to solve this problem by optimizing components jointly, and using a single criterion. This can also reduce the need for human experts to design and build speech recognition systems by painstakingly finding the best combination of several resources—which is still somewhat of a “black art.” This chapter will first discuss several recent deep-learning-based approaches to end-to-end speech recognition. Next, we will present the EESEN framework, which combines connectionist-temporal-classification-based acoustic models with a weighted finite state transducer decoding setup. EESEN achieves state-of-the art word error rates, while at the same time drastically simplifying the ASR pipeline.
What problem does this paper attempt to address?