Abstract:Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.

Adapting GPT, GPT-2 and BERT Language Models for Speech Recognition

Innovative Bert-based Reranking Language Models for Speech Recognition

End-to-End Speech Recognition with Pre-trained Masked Language Model

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Neural Grapheme-To-Phoneme Conversion with Pre-Trained Grapheme Models

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition

HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

TRANS-BLSTM: Transformer with Bidirectional LSTM for Language Understanding

Non-autoregressive Transformer-based End-to-end ASR using BERT

Multi-stage Large Language Model Correction for Speech Recognition

Gated Recurrent Units Based Hybrid Acoustic Models for Robust Speech Recognition

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems

Incorporating BERT into Parallel Sequence Decoding with Adapters.

ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

Language-Universal Phonetic Representation in Multilingual Speech Pretraining for Low-Resource Speech Recognition

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Improving Non-Autoregressive End-to-End Speech Recognition with Pre-Trained Acoustic and Language Models