Abstract:We propose a novel approach to enable the use of large, single speaker ASR models, such as Whisper, for target speaker ASR. The key insight of this method is that it is much easier to model relative differences among speakers by learning to condition on frame-level diarization outputs, than to learn the space of all speaker embeddings. We find that adding even a single bias term per diarization output type before the first transformer block can transform single speaker ASR models, into target speaker ASR models. Our target-speaker ASR model can be used for speaker attributed ASR by producing, in sequence, a transcript for each hypothesized speaker in a diarization output. This simplified model for speaker attributed ASR using only a single microphone outperforms cascades of speech separation and diarization by 11% absolute ORC-WER on the NOTSOFAR-1 dataset.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to apply large - scale single - speaker automatic speech recognition (ASR) models (such as Whisper) to target - speaker ASR (TS - ASR). Specifically, the researchers proposed a new method to adjust the model by using speaker diarization outputs at the frame level instead of relying on traditional speaker embeddings. This method aims to simplify the speech recognition task in multi - speaker environments and improve the ability to handle overlapping speech and multi - speaker data in real - world scenarios. ### Core problems of the paper: 1. **Challenges in multi - speaker ASR**: Existing large - scale ASR models (such as Whisper) are mainly designed for single - speaker scenarios, while in practical applications, most conversations involve multiple speakers and are usually recorded by one or more microphones. How to accurately recognize each speaker's speech in such a complex environment is an important challenge. 2. **Limitations of traditional methods**: Traditional TS - ASR methods rely on speaker embeddings or source separation techniques. These methods perform well on simulated datasets, but their performance drops significantly on real - world data. In addition, these methods usually require a large amount of training data and complex model structures. 3. **Relative difference modeling**: Instead of learning the embedding spaces of all speakers, the researchers believe that a simpler and more effective method is to model the relative differences between speakers. By using speaker diarization outputs at the frame level to adjust the model, this goal can be more easily achieved. ### Solutions proposed in the paper: - **Diarization - Conditioned Whisper**: The researchers proposed a new framework called "Diarization - Conditioned Whisper", which adjusts the Whisper model by introducing speaker diarization information at the frame level. Specifically, the model will perform different transformations on each time frame according to the current situation of the speaker (such as silence, target speaker, non - target speaker, overlapping speech, etc.), thereby generating transcription results for specific speakers. - **Frame - Level Diarization Dependent Transformations (FDDT)**: To achieve this, the researchers designed frame - level speaker diarization - dependent transformations (FDDT). These transformations can dynamically adjust the internal representation of the model according to the speaker diarization output, enabling the model to better distinguish different speakers. - **Experimental verification**: Through experiments on multiple datasets such as NOTSOFAR - 1, AMI, and Libri2Mix, the researchers proved the effectiveness of this method. In particular, in handling overlapping speech, this method is significantly better than traditional input mask methods and outperforms the performance of existing baseline models on some datasets. ### Summary: The main contribution of this paper is to provide a new, simple and effective TS - ASR method that can achieve multi - speaker speech recognition through simple frame - level transformations without relying on complex speaker embeddings. This not only simplifies the model structure but also improves the robustness to real - world data.

Target Speaker ASR with Whisper

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

Whisper in Medusa's Ear: Multi-head Efficient Decoding for Transformer-based ASR

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Extending Whisper with prompt tuning to target-speaker ASR

Speaker conditioned acoustic modeling for multi-speaker conversational ASR

Simultaneous Speech Recognition and Speaker Diarization for Monaural Dialogue Recordings with Target-Speaker Acoustic Models

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Anatomy of Industrial Scale Multilingual ASR

Resource-Efficient Adaptation of Speech Foundation Models for Multi-Speaker ASR

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Efficient Compression of Multitask Multilingual Speech Models

Multi-Channel Multi-Speaker ASR Using Target Speaker's Solo Segment

Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

Houston we have a Divergence: A Subgroup Performance Analysis of ASR Models

Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens

Target Speech Extraction Based on Blind Source Separation and X-vector-based Speaker Selection Trained with Data Augmentation

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR