Alignment-Free Training for Transducer-based Multi-Talker ASR

Takafumi Moriya,Shota Horiguchi,Marc Delcroix,Ryo Masumura,Takanori Ashihara,Hiroshi Sato,Kohei Matsuura,Masato Mimura
2024-09-30
Abstract:Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.
Audio and Speech Processing,Computation and Language,Sound
What problem does this paper attempt to address?
This paper attempts to solve several key problems in multi - talker Automatic Speech Recognition (multi - talker ASR), especially the challenges in dealing with overlapping speech. Specifically, the paper mainly focuses on the following points: 1. **Computational Complexity**: Traditional multi - talker Recurrent Neural Network Transducer (MT - RNNT) methods usually require multiple encoder or decoder branches, or serialize the transcription sequences of all speakers into a single output stream. The former has a high computational cost, and the latter requires accurate timestamps, which rely on the forced alignment of an external ASR system. 2. **Complexity of the Training Process**: Traditional methods such as token - level serialized output training (tSOT) require accurate timestamps to generate labels, which is especially challenging for real - world mixed recordings, and low - quality alignment will lead to performance degradation. 3. **Dependence on Additional Information**: Existing methods usually rely on rich alignment information, speaker information, or additional encoders, which may be difficult to obtain or computationally too expensive in practical applications. To solve these problems, the authors propose a new alignment - free training scheme (alignment - free training, AFT), named MT - RNNT - AFT. The main innovations of this method include: - **Simplified Label Generation**: By adding a prompt token at the beginning of each speaker's transcription to indicate the order of appearance of the speaker in the mixed audio, the need for forced alignment of an external ASR system is avoided. - **Shared Encoder Output**: Only one encoder processing is required to recognize the speech of all speakers, greatly reducing the computational cost. - **Parallel Decoding**: Through batch - processing decoder processing, the speech of all speakers can be recognized simultaneously. The experimental results show that MT - RNNT - AFT can achieve performance comparable to the existing state - of - the - art methods in both offline and online modes without relying on additional alignment information, speaker information, or additional encoders. In addition, combined with self - knowledge distillation (self - knowledge distillation, KD) and internal language model estimation (internal LM estimation, ILME), the performance of the model is further improved. In summary, this paper aims to simplify the training and decoding processes of multi - talker speech recognition while maintaining high performance and low computational cost.