On Speaker Attribution with SURT

Desh Raj,Matthew Wiesner,Matthew Maciejewski,Leibny Paola Garcia-Perera,Daniel Povey,Sanjeev Khudanpur
2024-01-28
Abstract:The Streaming Unmixing and Recognition Transducer (SURT) has recently become a popular framework for continuous, streaming, multi-talker speech recognition (ASR). With advances in architecture, objectives, and mixture simulation methods, it was demonstrated that SURT can be an efficient streaming method for speaker-agnostic transcription of real meetings. In this work, we push this framework further by proposing methods to perform speaker-attributed transcription with SURT, for both short mixtures and long recordings. We achieve this by adding an auxiliary speaker branch to SURT, and synchronizing its label prediction with ASR token prediction through HAT-style blank factorization. In order to ensure consistency in relative speaker labels across different utterance groups in a recording, we propose "speaker prefixing" -- appending each chunk with high-confidence frames of speakers identified in previous chunks, to establish the relative order. We perform extensive ablation experiments on synthetic LibriSpeech mixtures to validate our design choices, and demonstrate the efficacy of our final model on the AMI corpus.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
This paper mainly discusses how to achieve speaker attribution transcription in multi-speaker automatic speech recognition (ASR). The current methods are usually divided into modular (pipeline) and end-to-end approaches, but they suffer from error propagation and maintenance complexity. The paper proposes an extension of the Streaming Unmixing and Recognition Transducer (SURT) framework to handle speaker attribution transcription for short and long recordings. The SURT model consists of two components: separation (unmixing) and recognition. It can segment mixed audio into non-overlapping streams and transcribe them. To achieve speaker attribution, the paper proposes adding an auxiliary speaker transducer to the recognition module and synchronizes ASR token predictions and speaker label predictions through HAT-style blank decomposition. Additionally, a method called "speaker prefix" is proposed to maintain the consistency of relative speaker labels between different segments by appending high-confidence frames from previous segments to each segment. The paper conducts extensive ablation experiments on synthesized LibriSpeech mixtures to validate design choices and demonstrates the effectiveness of the final model on real meeting recordings from the AMI corpus. The experimental results show that this method performs well in continuous, streaming, and multi-speaker ASR tasks, especially in terms of speaker attribution. In conclusion, the goal of this paper is to improve the SURT model to handle any number of speakers without requiring a speaker directory, and this goal is achieved by adding auxiliary branches and the "speaker prefix" strategy.