Maximum-a-Posteriori-Based Decoding for End-to-End Acoustic Models

Naoyuki Kanda,Xugang Lu,Hisashi Kawai
DOI: https://doi.org/10.1109/TASLP.2017.2678162
2017-05-01
Abstract:This paper presents a novel decoding framework for acoustic models AMs based on end-to-end neural networks e.g., connectionist temporal classification. The end-to-end training of AMs has recently demonstrated high accuracy and efficiency in automatic speech recognition ASR. When using the trained AM in decoding, although a language model LM is implicitly involved in such an end-to-end AM, it is still essential to integrate an external LM trained with a large text corpus to achieve the best results. While there is no theoretical justification, most of the studies suggest using a naive interpolation of the end-to-end AM score and the external LM score, empirically. In this paper, we propose a more theoretically sound decoding framework derived from a maximization of the posterior probability of a word sequence given an observation. As a consequence of the theory, the subword LM is newly introduced to seamlessly integrate the external LM score with the end-to-end AM score. Our proposed method can be achieved by a small modification of the conventional weighted finite-state transducer-based implementation, without having to heavily increase the graph size. We tested the proposed decoding framework on ASR experiments with the Corpus of the Wall Street Journal and the Corpus of Spontaneous Japanese. The results showed that the proposed framework achieved significant and consistent improvements over the conventional interpolation-based decoding framework.
What problem does this paper attempt to address?