Abstract:Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a language model (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.

What problem does this paper attempt to address?

The paper attempts to address the problem of how to improve the performance of models in multi-speaker automatic speech recognition (ASR) through domain adaptation using only text data. Specifically, the paper proposes a new model structure—t-SOT FNT (Token-level Serialized Output Training with Factorized Neural Transducer), aimed at overcoming the limitations of existing t-SOT models when adapting to domains using only text data. ### Background and Challenges 1. **Challenges of Multi-Speaker ASR**: - The performance of multi-speaker ASR significantly declines when dealing with overlapping speech. - Although existing end-to-end (E2E) ASR models simplify the model structure, they face difficulties in domain adaptation using only text data, especially when introducing special tokens (such as ⟨cc⟩) to distinguish different speakers. 2. **Limitations of t-SOT Models**: - t-SOT models distinguish speech segments of different speakers by inserting special "channel change" tokens ⟨cc⟩, but this disrupts the natural language order, making it difficult for standard language models to handle the decoding sequence. - Therefore, existing t-SOT models perform poorly in domain adaptation using only text data. ### Solution 1. **t-SOT FNT Model**: - **Multiple Hidden States Design**: The t-SOT FNT model handles vocabulary transitions of different speakers by maintaining multiple hidden states. When a ⟨cc⟩ token is detected, the model switches to another hidden state, thus preserving the semantic transitions of each hidden state for a single speaker. - **Joint Network Prediction of ⟨cc⟩ Tokens**: The t-SOT FNT model treats ⟨cc⟩ tokens as special non-lexical tokens, predicted by a joint network rather than by a lexical predictor. - **Text Data Adaptation**: Through the above design, the t-SOT FNT model can effectively utilize external language models (LM) for text data adaptation, thereby improving the model's performance in specific domains. ### Experimental Results 1. **Performance on Single-Speaker and Multi-Speaker Datasets**: - On the AMI and ICSI datasets, the performance of the t-SOT FNT model is comparable to that of the t-SOT CT model, even slightly better in some cases. - On general single-speaker ASR datasets, the t-SOT FNT model outperforms the t-SOT CT model, narrowing the WER gap with single-speaker CT models. 2. **Text Data Adaptation Results**: - By adapting using text data from the LibriSpeech training set, the t-SOT FNT model achieved significant WER reductions on both single-speaker and multi-speaker datasets. - Specifically, on the LibriSpeechMix and LibriCSS datasets, the t-SOT FNT model achieved relative WER reductions of 8.4% and 7.5%, respectively. ### Conclusion The proposed t-SOT FNT model successfully addresses the challenge of domain adaptation using only text data in multi-speaker ASR. Through the design of multiple hidden states and joint network prediction of ⟨cc⟩ tokens, the model improves performance on both single-speaker and multi-speaker datasets. Experimental results show that the t-SOT FNT model not only excels in general ASR tasks but also further enhances recognition accuracy in specific domains through text data adaptation.

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Speech-T: Transducer for Text to Speech and Beyond

Alignment-Free Training for Transducer-based Multi-Talker ASR

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

Conformer-based Target-Speaker Automatic Speech Recognition for Single-Channel Audio

TST: Time-Sparse Transducer for Automatic Speech Recognition

Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic Token Prediction

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

Advanced Long-Content Speech Recognition With Factorized Neural Transducer

Transduce and Speak: Neural Transducer for Text-to-Speech with Semantic Token Prediction

Advancing Multi-talker ASR Performance with Large Language Models

TDASS: Target Domain Adaptation Speech Synthesis Framework for Multi-speaker Low-Resource TTS

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

Token-Level Serialized Output Training for Joint Streaming ASR and ST Leveraging Textual Alignments

Improved Factorized Neural Transducer Model For text-only Domain Adaptation

Label-Synchronous Neural Transducer for End-to-End ASR

CIF-T: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition