Jiawen Kang,Lingwei Meng,Mingyu Cui,Yuejiao Wang,Xixin Wu,Xunying Liu,Helen Meng
Abstract:Multi-talker speech recognition (MTASR) faces unique challenges in disentangling and transcribing overlapping speech. To address these challenges, this paper investigates the role of Connectionist Temporal Classification (CTC) in speaker disentanglement when incorporated with Serialized Output Training (SOT) for MTASR. Our visualization reveals that CTC guides the encoder to represent different speakers in distinct temporal regions of acoustic embeddings. Leveraging this insight, we propose a novel Speaker-Aware CTC (SACTC) training objective, based on the Bayes risk CTC framework. SACTC is a tailored CTC variant for multi-talker scenarios, it explicitly models speaker disentanglement by constraining the encoder to represent different speakers' tokens at specific time frames. When integrated with SOT, the SOT-SACTC model consistently outperforms standard SOT-CTC across various degrees of speech overlap. Specifically, we observe relative word error rate reductions of 10% overall and 15% on low-overlap speech. This work represents an initial exploration of CTC-based enhancements for MTASR tasks, offering a new perspective on speaker disentanglement in multi-talker speech recognition.
What problem does this paper attempt to address?
This paper attempts to address the key challenges in multi - talker speech recognition (MTASR), namely, how to effectively separate and transcribe each talker's speech when the speeches of multiple talkers overlap. Specifically, the paper mainly focuses on the following issues:
1. **Separation and Transcription of Multi - Talker Speech**: Traditional automatic speech recognition (ASR) systems usually handle the speech of a single talker, while MTASR needs to process the speeches of multiple talkers simultaneously and transcribe them into text respectively.
2. **Application of CTC in Multi - Talker Scenarios**: Connectionist Temporal Classification (CTC) is a training criterion widely used in sequence - to - sequence tasks (such as speech recognition). However, in multi - talker speech recognition, the monotonicity assumption of CTC makes its direct application in multi - talker scenarios non - intuitive. Therefore, it is of great significance to study the performance of CTC in such complex scenarios and its improvement methods.
3. **Proposal of Speaker - Aware CTC (SACTC)**: To overcome the limitations of existing methods in handling multi - talker speech, the paper proposes a new training objective - Speaker - Aware CTC (SACTC). By introducing the Bayes risk CTC framework, it explicitly models the representations of different talkers at specific time frames, thereby achieving more effective talker separation.
### Main Contributions of the Paper
- **Revealing the Role of CTC in Multi - Talker Speech Recognition**: Research shows that the CTC loss function can guide the encoder to represent different talkers in different time regions of the acoustic embedding.
- **Proposing the SACTC Training Objective**: Based on the Bayes risk CTC framework, a CTC variant specifically for multi - talker scenarios is proposed. By constraining the encoder to represent the tokens of different talkers at specific time frames, speaker separation is explicitly modeled.
- **Experimental Verification of the Effectiveness of SACTC**: Experimental results show that the SOT - SACTC model outperforms the standard SOT - CTC model under various degrees of speech overlap. In particular, under low - overlap speech conditions, the relative word error rate is reduced by 15%.
### Formula Summary
- **CTC Posterior Probability Calculation**:
\[
P(l|x)=\sum_{\pi\in B^{-1}(l)}p(\pi|x)
\]
where \(l\) is the label sequence, \(x\) is the input acoustic embedding, and \(B(\pi) = l\) means mapping the alignment path to the label sequence.
- **SACTC Risk Function**:
\[
r_{sa}(s,t)=\begin{cases}
-\frac{1}{1 + e^{\lambda\left(\frac{t}{T}-b\right)}}&\text{if }s = 1\\
-\frac{1}{1 + e^{-\lambda\left(\frac{t}{T}-b\right)}}&\text{otherwise}
\end{cases}
\]
where \(s\) is the talker index, \(t\) is the time step, \(\lambda\) is the Bayes risk factor, and \(b\) is the proportion of the talker's pronunciation length.
Through these improvements, the paper provides a new perspective and solution for multi - talker speech recognition, especially showing significant advantages in handling overlapping speech.