Abstract:ASR remains unsatisfactory in scenarios where the speaking style diverges from that used to train ASR systems, resulting in erroneous transcripts. To address this, ASR Error Correction (AEC), a post-ASR processing approach, is required. In this work, we tackle an understudied issue: the Low-Resource Out-of-Domain (LROOD) problem, by investigating crossmodal AEC on very limited downstream data with 1-best hypothesis transcription. We explore pre-training and fine-tuning strategies and uncover an ASR domain discrepancy phenomenon, shedding light on appropriate training schemes for LROOD data. Moreover, we propose the incorporation of discrete speech units to align with and enhance the word embeddings for improving AEC quality. Results from multiple corpora and several evaluation metrics demonstrate the feasibility and efficacy of our proposed AEC approach on LROOD data as well as its generalizability and superiority on large-scale data. Finally, a study on speech emotion recognition confirms that our model produces ASR error-robust transcripts suitable for downstream applications.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of Automatic Speech Recognition (ASR) Error Correction (AEC) in Low - Resource Out - of - Domain (LROOD) data. Specifically, the paper focuses on the following aspects: 1. **AEC Performance under Low - Resource, Out - of - Domain Data**: The paper explores the effectiveness of AEC on very limited downstream data, especially when these data are significantly different from the data used to train the ASR system. Researchers address this issue through pre - training and fine - tuning strategies, and reveal the ASR domain - difference phenomenon, providing an appropriate training scheme for LROOD data. 2. **Application of Discrete Speech Units (DSUs)**: In order to improve the quality of AEC, the paper proposes to introduce DSUs in the fine - tuning stage to align and enhance word embeddings. This helps to improve the effect of AEC, especially in high - error - rate speech data. 3. **Cross - Modal AEC**: The paper explores how to use audio information to improve AEC, especially when the audio source of large - scale pre - trained data is unavailable. Researchers find that using discrete acoustic features (such as acoustic word embeddings generated by HuBERT) can improve the effect of AEC more than continuous acoustic features (such as Mel - spectrograms). 4. **Downstream Task Applications**: The paper also verifies the performance of the transcribed text after AEC processing in downstream tasks such as Speech Emotion Recognition (SER). The results show that AEC can significantly improve the performance of these tasks. In summary, the main objective of this paper is to solve the challenge of ASR error correction in LROOD data, and to improve the quality of AEC and the applicability of downstream tasks by introducing DSUs and cross - modal methods.

Crossmodal ASR Error Correction with Discrete Speech Units

Cross Modal Training for ASR Error Correction with Contrastive Learning.

ASR-EC Benchmark: Evaluating Large Language Models on Chinese ASR Error Correction

Speech Emotion Recognition with ASR Transcripts: A Comprehensive Study on Word Error Rate and Fusion Techniques

Towards Decoupling Frontend Enhancement and Backend Recognition in Monaural Robust ASR

UCorrect: An Unsupervised Framework for Automatic Speech Recognition Error Correction

ASR Error Correction using Large Language Models

Unified Cross-Modal Attention: Robust Audio-Visual Speech Recognition and Beyond

MF-AED-AEC: Speech Emotion Recognition by Leveraging Multimodal Fusion, Asr Error Detection, and Asr Error Correction

PCAD: Towards ASR-Robust Spoken Language Understanding via Prototype Calibration and Asymmetric Decoupling

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

ASR-Robust Spoken Language Understanding on ASR-GLUE dataset

ASR and Emotional Speech: A Word-Level Investigation of the Mutual Impact of Speech and Emotion Recognition

Improvement of Acoustic Models Fused with Lip Visual Information for Low-Resource Speech

C²A-SLU: Cross and Contrastive Attention for Improving ASR Robustness in Spoken Language Understanding

On Modular Training of Neural Acoustics-to-Word Model for LVCSR

Cross-utterance ASR Rescoring with Graph-based Label Propagation

Reducing Multilingual Context Confusion for End-to-end Code-switching Automatic Speech Recognition

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

Recent Progress in the CUHK Dysarthric Speech Recognition System

Automatic Speech Recognition Post-Processing for Readability: Task, Dataset and a Two-Stage Pre-Trained Approach