Abstract:Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

Parameterization of Dominant Spectral Peak Trajectory for Whisper Speech Recognition

Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation

End-to-end Whispered Speech Recognition with Frequency-weighted Approaches and Pseudo Whisper Pre-training

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR

Whisper-PMFA: Partial Multi-Scale Feature Aggregation for Speaker Verification using Whisper Models

MC-Whisper: Extending Speech Foundation Models to Multichannel Distant Speech Recognition

Extending Whisper with prompt tuning to target-speaker ASR

A Study on Incorporating Whisper for Robust Speech Assessment

Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition

Leveraging Self-Supervised Models for Automatic Whispered Speech Recognition

A Multitask Training Approach to Enhance Whisper with Open-Vocabulary Keyword Spotting

A Multitask Training Approach to Enhance Whisper with Contextual Biasing and Open-Vocabulary Keyword Spotting

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Efficient Compression of Multitask Multilingual Speech Models

Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Kid-Whisper: Towards Bridging the Performance Gap in Automatic Speech Recognition for Children VS. Adults

Reconstruction of Pitch for Whisper-to-speech Conversion of Chinese

Improving Whisper's Recognition Performance for Under-Represented Language Kazakh Leveraging Unpaired Speech and Text