Advancing Multi-talker ASR Performance with Large Language Models

Mohan Shi,Zengrui Jin,Yaoxun Xu,Yong Xu,Shi-Xiong Zhang,Kun Wei,Yiwen Shao,Chunlei Zhang,Dong Yu
2024-08-31
Abstract:Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.
Audio and Speech Processing,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the challenge of recognizing overlapping speech in multi-speaker dialogue scenarios. Specifically, it focuses on improving the performance of automatic speech recognition (ASR) systems in multi-speaker conversations, particularly in cases of speech overlap. Traditional ASR systems perform well in quiet, single-speaker scenarios but face significant challenges in multi-speaker, especially overlapping speech situations. To tackle this challenge, the paper proposes a Sequence Output Training (SOT) method based on large-scale language models (LLM) for multi-speaker ASR tasks. This method leverages pre-trained speech encoders and large-scale language models, and through appropriate fine-tuning strategies, it is trained on multi-speaker datasets. Experimental results show that this method outperforms traditional attention-based encoder-decoder (AED) architecture methods on both the simulated dataset LibriMix and the real-world dataset AMI evaluation sets. Notably, on the AMI dataset, it even surpasses AED models trained with 1000 times more supervised data.