Abstract:Large Language Models (LLMs) have demonstrated substantial potential for error correction in Automatic Speech Recognition (ASR). However, most research focuses on utterances from short-duration speech recordings, which are the predominant form of speech data for supervised ASR training. This paper investigates the effectiveness of LLMs for error correction in full-text generated by ASR systems from longer speech recordings, such as transcripts from podcasts, news broadcasts, and meetings. First, we develop a Chinese dataset for full-text error correction, named ChFT, utilizing a pipeline that involves text-to-speech synthesis, ASR, and error-correction pair extractor. This dataset enables us to correct errors across contexts, including both full-text and segment, and to address a broader range of error types, such as punctuation restoration and inverse text normalization, thus making the correction process comprehensive. Second, we fine-tune a pre-trained LLM on the constructed dataset using a diverse set of prompts and target formats, and evaluate its performance on full-text error correction. Specifically, we design prompts based on full-text and segment, considering various output formats, such as directly corrected text and JSON-based error-correction pairs. Through various test settings, including homogeneous, up-to-date, and hard test sets, we find that the fine-tuned LLMs perform well in the full-text setting with different prompts, each presenting its own strengths and weaknesses. This establishes a promising baseline for further research. The dataset is available on the website.

Research on Deep Processing Technologies for Large-Scale Corpora

An Adaptive Post-processing Method using Proofreading Information for Chinese Character Recognition

Quality Assurance Of Automatic Annotation Of Very Large Corpora: A Study Based On Heterogeneous Tagging Systems

Summary of Text Automatic Proofreading Technology

Experimental Study of Hidden Markov Model Based Part-of-speech Tagging for Chinese Texts

Automatic Abstraction of Long Chinese Patent Texts Based on P-Bertsum Model

On the (In)Effectiveness of Large Language Models for Chinese Text Correction

Blending segmentation with tagging in Chinese language corpus processing

Research on Computer Intelligent Proofreading System for English Translation Based on Deep Learning

Researches on Large Scale Corpus-Based Syntactic Pattern Matching

Reducing Approximation and Estimation Errors for Chinese Lexical Processing with Heterogeneous Annotations

Joint Chinese Word Segmentation and POS Tagging on Heterogeneous Annotated Corpora with Multiple Task Learning.

Semantic error checking in automatic proofreading for Chinese texts

Chinese Natural Language Processing: From Text Categorization to Machine Translation

Recent Developments in Chinese Corpus Research

Build a Large-Scale Syntactically Annotated Chinese Corpus

Towards Accurate and Efficient Chinese Part-of-Speech Tagging.

Defect Correction Method for Software Requirements Text Using Large Language Models

Full-text Error Correction for Chinese Speech Recognition with Large Language Model

A Unified Model for Joint Chinese Word Segmentation and POS Tagging with Heterogeneous Annotation Corpora.

Annotating the Contemporary Chinese Corpus