Factorized AED: Factorized Attention-Based Encoder-Decoder for Text-Only Domain Adaptive ASR

Xun Gong,Wei Wang,Hang Shao,Xie Chen,Yanmin Qian
DOI: https://doi.org/10.1109/icassp49357.2023.10095937
2023-01-01
Abstract:End-to-end automatic speech recognition (ASR) systems have gained popularity given their simplified architecture and promising results. However, text-only domain adaptation remains a big challenge for E2E systems. Text-to-speech (TTS) based approaches fine-tune ASR models by synthesized speech with an auxiliary TTS model, thus increase deployment costs. Language model (LM) fusion based approaches can achieve good performance but are sensitive to interpolation parameters. In order to factorize out the language component in the AED model, we propose the factorized attention-based encoder-decoder (Factorized AED) model whose decoder takes as input the posterior probabilities of a jointly trained LM. Moreover, in the context of domain adaptation, the domain specific LM serves as a plug-and-play component for a well-trained factorized AED model. In-domain experiments on LibriSpeech and out-of-domain experiments adapting from LibriSpeech to a variety of domains in GigaSpeech are conducted to validate the effectiveness of our proposed methods. Results show 20% / 24% relative word error rate (WER) reduction for LibriSpeech test sets and 8 ∼34% relative WER reduction for 8 GigaSpeech target domains test sets compared to the AED baseline.
What problem does this paper attempt to address?