Advancing Multi-talker ASR Performance with Large Language Models

Mohan Shi,Zengrui Jin,Yaoxun Xu,Yong Xu,Shi-Xiong Zhang,Kun Wei,Yiwen Shao,Chunlei Zhang,Dong Yu

2024-08-31

Abstract:Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.

Audio and Speech Processing,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the challenge of recognizing overlapping speech in multi-speaker dialogue scenarios. Specifically, it focuses on improving the performance of automatic speech recognition (ASR) systems in multi-speaker conversations, particularly in cases of speech overlap. Traditional ASR systems perform well in quiet, single-speaker scenarios but face significant challenges in multi-speaker, especially overlapping speech situations. To tackle this challenge, the paper proposes a Sequence Output Training (SOT) method based on large-scale language models (LLM) for multi-speaker ASR tasks. This method leverages pre-trained speech encoders and large-scale language models, and through appropriate fine-tuning strategies, it is trained on multi-speaker datasets. Experimental results show that this method outperforms traditional attention-based encoder-decoder (AED) architecture methods on both the simulated dataset LibriMix and the real-world dataset AMI evaluation sets. Notably, on the AMI dataset, it even surpasses AED models trained with 1000 times more supervised data.

Advancing Multi-talker ASR Performance with Large Language Models

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR

Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets

Using Large Language Model for End-to-End Chinese ASR and NER

Adapting Large Language Model with Speech for Fully Formatted End-to-End Speech Recognition

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

An Embarrassingly Simple Approach for LLM with Strong ASR Capacity

Prompting Large Language Models with Speech Recognition Abilities

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

A Transcription Prompt-based Efficient Audio Large Language Model for Robust Speech Recognition

Connecting Speech Encoder and Large Language Model for ASR

Tuning Large language model for End-to-end Speech Translation

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

A Survey on Speech Large Language Models