t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Jian Wu,Naoyuki Kanda,Takuya Yoshioka,Rui Zhao,Zhuo Chen,Jinyu Li
2023-09-15
Abstract:Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The paper attempts to address the problem of how to improve the performance of models in multi-speaker automatic speech recognition (ASR) through domain adaptation using only text data. Specifically, the paper proposes a new model structure—t-SOT FNT (Token-level Serialized Output Training with Factorized Neural Transducer), aimed at overcoming the limitations of existing t-SOT models when adapting to domains using only text data. ### Background and Challenges 1. **Challenges of Multi-Speaker ASR**: - The performance of multi-speaker ASR significantly declines when dealing with overlapping speech. - Although existing end-to-end (E2E) ASR models simplify the model structure, they face difficulties in domain adaptation using only text data, especially when introducing special tokens (such as ⟨cc⟩) to distinguish different speakers. 2. **Limitations of t-SOT Models**: - t-SOT models distinguish speech segments of different speakers by inserting special "channel change" tokens ⟨cc⟩, but this disrupts the natural language order, making it difficult for standard language models to handle the decoding sequence. - Therefore, existing t-SOT models perform poorly in domain adaptation using only text data. ### Solution 1. **t-SOT FNT Model**: - **Multiple Hidden States Design**: The t-SOT FNT model handles vocabulary transitions of different speakers by maintaining multiple hidden states. When a ⟨cc⟩ token is detected, the model switches to another hidden state, thus preserving the semantic transitions of each hidden state for a single speaker. - **Joint Network Prediction of ⟨cc⟩ Tokens**: The t-SOT FNT model treats ⟨cc⟩ tokens as special non-lexical tokens, predicted by a joint network rather than by a lexical predictor. - **Text Data Adaptation**: Through the above design, the t-SOT FNT model can effectively utilize external language models (LM) for text data adaptation, thereby improving the model's performance in specific domains. ### Experimental Results 1. **Performance on Single-Speaker and Multi-Speaker Datasets**: - On the AMI and ICSI datasets, the performance of the t-SOT FNT model is comparable to that of the t-SOT CT model, even slightly better in some cases. - On general single-speaker ASR datasets, the t-SOT FNT model outperforms the t-SOT CT model, narrowing the WER gap with single-speaker CT models. 2. **Text Data Adaptation Results**: - By adapting using text data from the LibriSpeech training set, the t-SOT FNT model achieved significant WER reductions on both single-speaker and multi-speaker datasets. - Specifically, on the LibriSpeechMix and LibriCSS datasets, the t-SOT FNT model achieved relative WER reductions of 8.4% and 7.5%, respectively. ### Conclusion The proposed t-SOT FNT model successfully addresses the challenge of domain adaptation using only text data in multi-speaker ASR. Through the design of multiple hidden states and joint network prediction of ⟨cc⟩ tokens, the model improves performance on both single-speaker and multi-speaker datasets. Experimental results show that the t-SOT FNT model not only excels in general ASR tasks but also further enhances recognition accuracy in specific domains through text data adaptation.