Abstract:Speaker-attributed automatic speech recognition (ASR) in multi-speaker environments remains a significant challenge, particularly when systems conditioned on speaker embeddings fail to generalize to unseen speakers. In this work, we propose Diarization-Conditioned Whisper (DiCoW), a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. DiCoW extends the pre-trained Whisper model by integrating diarization labels directly, eliminating reliance on speaker embeddings and reducing the need for extensive speaker-specific training data. Our method introduces frame-level diarization-dependent transformations (FDDT) and query-key biasing (QKb) techniques to refine the model's focus on target speakers while effectively handling overlapping speech. By leveraging diarization outputs as conditioning signals, DiCoW simplifies the workflow for multi-speaker ASR, improves generalization to unseen speakers and enables more reliable transcription in real-world multi-speaker recordings. Additionally, we explore the integration of a connectionist temporal classification (CTC) head to Whisper and demonstrate its ability to improve transcription efficiency through hybrid decoding. Notably, we show that our approach is not limited to Whisper; it also provides similar benefits when applied to the Branchformer model. We validate DiCoW on real-world datasets, including AMI and NOTSOFAR-1 from CHiME-8 challenge, as well as synthetic benchmarks such as Libri2Mix and LibriCSS, enabling direct comparisons with previous methods. Results demonstrate that DiCoW enhances the model's target-speaker ASR capabilities while maintaining Whisper's accuracy and robustness on single-speaker data.

DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition

Target Speaker ASR with Whisper

Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation

Speaker-turn aware diarization for speech-based cognitive assessments

DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

Leveraging Self-Supervised Learning for Speaker Diarization

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Speaker conditioned acoustic modeling for multi-speaker conversational ASR

Multi-channel Conversational Speaker Separation via Neural Diarization

MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

Adapting Whisper for Code-Switching through Encoding Refining and Language-Aware Decoding

Data Efficient Child-Adult Speaker Diarization with Simulated Conversations

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition