Abstract:Extending the RNN Transducer (RNNT) to recognize multi-talker speech is essential for wider automatic speech recognition (ASR) applications. Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation. MT-RNNT is conventionally implemented using architectures with multiple encoders or decoders, or by serializing all speakers' transcriptions into a single output stream. The first approach is computationally expensive, particularly due to the need for multiple encoder processing. In contrast, the second approach involves a complex label generation process, requiring accurate timestamps of all words spoken by all speakers in the mixture, obtained from an external ASR system. In this paper, we propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture. The target labels are created by appending a prompt token corresponding to each speaker at the beginning of the transcription, reflecting the order of each speaker's appearance in the mixtures. Thus, MT-RNNT-AFT can be trained without relying on accurate alignments, and it can recognize all speakers' speech with just one round of encoder processing. Experiments show that MT-RNNT-AFT achieves performance comparable to that of the state-of-the-art alternatives, while greatly simplifying the training process.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in multi - talker Automatic Speech Recognition (multi - talker ASR), especially the challenges in dealing with overlapping speech. Specifically, the paper mainly focuses on the following points: 1. **Computational Complexity**: Traditional multi - talker Recurrent Neural Network Transducer (MT - RNNT) methods usually require multiple encoder or decoder branches, or serialize the transcription sequences of all speakers into a single output stream. The former has a high computational cost, and the latter requires accurate timestamps, which rely on the forced alignment of an external ASR system. 2. **Complexity of the Training Process**: Traditional methods such as token - level serialized output training (tSOT) require accurate timestamps to generate labels, which is especially challenging for real - world mixed recordings, and low - quality alignment will lead to performance degradation. 3. **Dependence on Additional Information**: Existing methods usually rely on rich alignment information, speaker information, or additional encoders, which may be difficult to obtain or computationally too expensive in practical applications. To solve these problems, the authors propose a new alignment - free training scheme (alignment - free training, AFT), named MT - RNNT - AFT. The main innovations of this method include: - **Simplified Label Generation**: By adding a prompt token at the beginning of each speaker's transcription to indicate the order of appearance of the speaker in the mixed audio, the need for forced alignment of an external ASR system is avoided. - **Shared Encoder Output**: Only one encoder processing is required to recognize the speech of all speakers, greatly reducing the computational cost. - **Parallel Decoding**: Through batch - processing decoder processing, the speech of all speakers can be recognized simultaneously. The experimental results show that MT - RNNT - AFT can achieve performance comparable to the existing state - of - the - art methods in both offline and online modes without relying on additional alignment information, speaker information, or additional encoders. In addition, combined with self - knowledge distillation (self - knowledge distillation, KD) and internal language model estimation (internal LM estimation, ILME), the performance of the model is further improved. In summary, this paper aims to simplify the training and decoding processes of multi - talker speech recognition while maintaining high performance and low computational cost.

Alignment-Free Training for Transducer-based Multi-Talker ASR

Alignment Restricted Streaming Recurrent Neural Network Transducer.

Knowledge Distillation for Neural Transducer-based Target-Speaker ASR: Exploiting Parallel Mixture/Single-Talker Speech Data

End-to-End Joint Target and Non-Target Speakers ASR

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

Improving RNN transducer with normalized jointer network

A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

Improving Scheduled Sampling for Neural Transducer-based ASR

Non-Autoregressive End-To-End Automatic Speech Recognition Incorporating Downstream Natural Language Processing

Tandem Multitask Training of Speaker Diarisation and Speech Recognition for Meeting Transcription

Multi-blank Transducers for Speech Recognition

Self-Attention Transducers for End-to-End Speech Recognition

End-to-End Multi-speaker Speech Recognition with Transformer.

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Attention-based Transducer for Online Speech Recognition

End-to-end Monaural Multi-speaker ASR System Without Pretraining.

Cascaded encoders for fine-tuning ASR models on overlapped speech

Efficient minimum word error rate training of RNN-Transducer for end-to-end speech recognition

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer