Abstract:Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.

What problem does this paper attempt to address?

This paper attempts to address the key challenges in multi - talker speech recognition (MTASR), namely, how to effectively separate and transcribe each talker's speech when the speeches of multiple talkers overlap. Specifically, the paper mainly focuses on the following issues: 1. **Separation and Transcription of Multi - Talker Speech**: Traditional automatic speech recognition (ASR) systems usually handle the speech of a single talker, while MTASR needs to process the speeches of multiple talkers simultaneously and transcribe them into text respectively. 2. **Application of CTC in Multi - Talker Scenarios**: Connectionist Temporal Classification (CTC) is a training criterion widely used in sequence - to - sequence tasks (such as speech recognition). However, in multi - talker speech recognition, the monotonicity assumption of CTC makes its direct application in multi - talker scenarios non - intuitive. Therefore, it is of great significance to study the performance of CTC in such complex scenarios and its improvement methods. 3. **Proposal of Speaker - Aware CTC (SACTC)**: To overcome the limitations of existing methods in handling multi - talker speech, the paper proposes a new training objective - Speaker - Aware CTC (SACTC). By introducing the Bayes risk CTC framework, it explicitly models the representations of different talkers at specific time frames, thereby achieving more effective talker separation. ### Main Contributions of the Paper - **Revealing the Role of CTC in Multi - Talker Speech Recognition**: Research shows that the CTC loss function can guide the encoder to represent different talkers in different time regions of the acoustic embedding. - **Proposing the SACTC Training Objective**: Based on the Bayes risk CTC framework, a CTC variant specifically for multi - talker scenarios is proposed. By constraining the encoder to represent the tokens of different talkers at specific time frames, speaker separation is explicitly modeled. - **Experimental Verification of the Effectiveness of SACTC**: Experimental results show that the SOT - SACTC model outperforms the standard SOT - CTC model under various degrees of speech overlap. In particular, under low - overlap speech conditions, the relative word error rate is reduced by 15%. ### Formula Summary - **CTC Posterior Probability Calculation**: \[ P(l|x)=\sum_{\pi\in B^{-1}(l)}p(\pi|x) \] where \(l\) is the label sequence, \(x\) is the input acoustic embedding, and \(B(\pi) = l\) means mapping the alignment path to the label sequence. - **SACTC Risk Function**: \[ r_{sa}(s,t)=\begin{cases} -\frac{1}{1 + e^{\lambda\left(\frac{t}{T}-b\right)}}&\text{if }s = 1\\ -\frac{1}{1 + e^{-\lambda\left(\frac{t}{T}-b\right)}}&\text{otherwise} \end{cases} \] where \(s\) is the talker index, \(t\) is the time step, \(\lambda\) is the Bayes risk factor, and \(b\) is the proportion of the talker's pronunciation length. Through these improvements, the paper provides a new perspective and solution for multi - talker speech recognition, especially showing significant advantages in handling overlapping speech.

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

Bridging the Gaps of Both Modality and Language: Synchronous Bilingual CTC for Speech Translation and Speech Recognition

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

CR-CTC: Consistency regularization on CTC for improved speech recognition

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Continuous Target Speech Extraction: Enhancing Personalized Diarization and Extraction on Complex Recordings

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Speaker Adaptation for End-to-End CTC Models.

Enhancing CTC-based speech recognition with diverse modeling units

Cross-modal Alignment with Optimal Transport for CTC-based ASR

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Cascaded encoders for fine-tuning ASR models on overlapped speech

Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

A multitask co-training framework for improving speech translation by leveraging speech recognition and machine translation tasks