Abstract:Benefiting from massive and diverse data sources, speech foundation models exhibit strong generalization and knowledge transfer capabilities to a wide range of downstream tasks. However, a limitation arises from their exclusive handling of single-speaker speech input, making them ineffective in recognizing multi-speaker overlapped speech, a common occurrence in real-world scenarios. In this study, we delve into the adaptation of speech foundation models to eliminate interfering speakers from overlapping speech and perform target-speaker automatic speech recognition (TS-ASR). Initially, we utilize the Whisper model as the foundation for adaptation and conduct a thorough comparison of its integration with existing target-speaker adaptation techniques. We then propose an innovative model termed Speaker-Querying Whisper (SQ-Whisper), which employs a set number of trainable queries to capture speaker prompts from overlapping speech based on target-speaker enrollment. These prompts serve to steer the model in extracting speaker-specific features and accurately recognizing target-speaker transcriptions. Experimental results demonstrate that our approach effectively adapts the pre-trained speech foundation model to TS-ASR. Compared with the robust TS-HuBERT model, the proposed SQ-Whisper significantly improves performance, yielding up to 15% and 10% relative reductions in word error rates (WERs) on the Libri2Mix and WSJ0-2Mix datasets, respectively. With data augmentation, we establish new state-of-the-art WERs of 14.6% on the Libri2Mix Test set and 4.4% on the WSJ0-2Mix Test set. Furthermore, we evaluate our model on the real-world AMI meeting dataset, which shows consistent improvement over other adaptation methods.

Who is Speaking to Whom? Learning to Identify Utterance Addressee in Multi-Party Conversations.

Who Says What to Whom: A Survey of Multi-Party Conversations

Learning WHO Saying WHAT to WHOM in Multi-Party Conversations

Toward an end-to-end implicit addressee modeling for dialogue disentanglement

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

To Whom are You Talking? A Deep Learning Model to Endow Social Robots with Addressee Estimation Skills

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Addressee Detection Using Facial and Audio Features in Mixed Human–Human and Human–Robot Settings: A Deep Learning Framework

Who Responded to Whom: The Joint Effects of Latent Topics and Discourse in Conversation Structure

Joint Learning for Addressee Selection and Response Generation in Multi-Party Conversation

Do LLMs suffer from Multi-Party Hangover? A Diagnostic Approach to Addressee Recognition and Response Selection in Conversations

Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Deep Learning Based Multi-modal Addressee Recognition in Visual Scenes with Utterances

When Less is More: Using Less Context Information to Generate Better Utterances in Group Conversations.

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

WASE: Learning When to Attend for Speaker Extraction in Cocktail Party Environments

With a Little Help from my (Linguistic) Friends: Topic Segmentation of Multi-party Casual Conversations

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR