Target Speaker ASR with Whisper

Alexander Polok,Dominik Klement,Matthew Wiesner,Sanjeev Khudanpur,Jan Černocký,Lukáš Burget
2024-09-15
Abstract:We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to apply large - scale single - speaker automatic speech recognition (ASR) models (such as Whisper) to target - speaker ASR (TS - ASR). Specifically, the researchers proposed a new method to adjust the model by using speaker diarization outputs at the frame level instead of relying on traditional speaker embeddings. This method aims to simplify the speech recognition task in multi - speaker environments and improve the ability to handle overlapping speech and multi - speaker data in real - world scenarios. ### Core problems of the paper: 1. **Challenges in multi - speaker ASR**: Existing large - scale ASR models (such as Whisper) are mainly designed for single - speaker scenarios, while in practical applications, most conversations involve multiple speakers and are usually recorded by one or more microphones. How to accurately recognize each speaker's speech in such a complex environment is an important challenge. 2. **Limitations of traditional methods**: Traditional TS - ASR methods rely on speaker embeddings or source separation techniques. These methods perform well on simulated datasets, but their performance drops significantly on real - world data. In addition, these methods usually require a large amount of training data and complex model structures. 3. **Relative difference modeling**: Instead of learning the embedding spaces of all speakers, the researchers believe that a simpler and more effective method is to model the relative differences between speakers. By using speaker diarization outputs at the frame level to adjust the model, this goal can be more easily achieved. ### Solutions proposed in the paper: - **Diarization - Conditioned Whisper**: The researchers proposed a new framework called "Diarization - Conditioned Whisper", which adjusts the Whisper model by introducing speaker diarization information at the frame level. Specifically, the model will perform different transformations on each time frame according to the current situation of the speaker (such as silence, target speaker, non - target speaker, overlapping speech, etc.), thereby generating transcription results for specific speakers. - **Frame - Level Diarization Dependent Transformations (FDDT)**: To achieve this, the researchers designed frame - level speaker diarization - dependent transformations (FDDT). These transformations can dynamically adjust the internal representation of the model according to the speaker diarization output, enabling the model to better distinguish different speakers. - **Experimental verification**: Through experiments on multiple datasets such as NOTSOFAR - 1, AMI, and Libri2Mix, the researchers proved the effectiveness of this method. In particular, in handling overlapping speech, this method is significantly better than traditional input mask methods and outperforms the performance of existing baseline models on some datasets. ### Summary: The main contribution of this paper is to provide a new, simple and effective TS - ASR method that can achieve multi - speaker speech recognition through simple frame - level transformations without relying on complex speaker embeddings. This not only simplifies the model structure but also improves the robustness to real - world data.