Crossmodal ASR Error Correction with Discrete Speech Units

Yuanchao Li,Pinzhen Chen,Peter Bell,Catherine Lai
2024-09-13
Abstract:ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of Automatic Speech Recognition (ASR) Error Correction (AEC) in Low - Resource Out - of - Domain (LROOD) data. Specifically, the paper focuses on the following aspects: 1. **AEC Performance under Low - Resource, Out - of - Domain Data**: The paper explores the effectiveness of AEC on very limited downstream data, especially when these data are significantly different from the data used to train the ASR system. Researchers address this issue through pre - training and fine - tuning strategies, and reveal the ASR domain - difference phenomenon, providing an appropriate training scheme for LROOD data. 2. **Application of Discrete Speech Units (DSUs)**: In order to improve the quality of AEC, the paper proposes to introduce DSUs in the fine - tuning stage to align and enhance word embeddings. This helps to improve the effect of AEC, especially in high - error - rate speech data. 3. **Cross - Modal AEC**: The paper explores how to use audio information to improve AEC, especially when the audio source of large - scale pre - trained data is unavailable. Researchers find that using discrete acoustic features (such as acoustic word embeddings generated by HuBERT) can improve the effect of AEC more than continuous acoustic features (such as Mel - spectrograms). 4. **Downstream Task Applications**: The paper also verifies the performance of the transcribed text after AEC processing in downstream tasks such as Speech Emotion Recognition (SER). The results show that AEC can significantly improve the performance of these tasks. In summary, the main objective of this paper is to solve the challenge of ASR error correction in LROOD data, and to improve the quality of AEC and the applicability of downstream tasks by introducing DSUs and cross - modal methods.