Full-text Error Correction for Chinese Speech Recognition with Large Language Model

Zhiyuan Tang,Dong Wang,Shen Huang,Shidong Shang
2024-09-12
Abstract:Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.
Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the problem of error correction in long texts generated by automatic speech recognition (ASR) systems. Specifically, most existing research mainly focuses on single-sentence error correction for short recordings, which are typically used for supervised ASR training. However, this approach has limitations when dealing with long texts (such as podcasts, news broadcasts, and meeting transcripts), as it fails to comprehensively capture the contextual information of the entire conversation or document and is computationally expensive. To address this issue, the paper proposes the following points: 1. **Constructing a Chinese Full-Text Error Correction Dataset (ChFT)**: By using a pipeline that includes text synthesis, ASR, and error correction pair extraction, a dataset specifically for full-text error correction is constructed. This dataset covers not only full-text and paragraph-level error correction but also includes various error types such as punctuation restoration and inverse text normalization. 2. **Using Large Language Models (LLM) for Error Correction**: By fine-tuning pre-trained LLMs, the performance of these models in full-text error correction is evaluated using various prompts and target formats. Different prompts based on full-text and paragraphs are designed, considering multiple output formats such as direct text correction and JSON format error correction pairs. 3. **Experimental Evaluation**: The fine-tuned LLMs are evaluated under different prompts through various test settings, including in-domain test sets, latest test sets, and challenging test sets. The results show that the fine-tuned LLMs perform well in the full-text error correction task, with each prompt having its own advantages and disadvantages. Overall, the paper aims to explore and evaluate the potential of LLMs in long-text error correction, providing a strong benchmark for subsequent research.