Abstract:Although significant progress has been made in single-talker automatic speech recognition (ASR), there is still a large performance gap between multi-talker and single-talker speech recognition systems. In this article, we propose an enhanced end-to-end monaural multi-talker ASR architecture and training strategy to recognize the overlapped speech. The single-talker end-to-end model is extended to a multi-talker architecture with permutation invariant training (PIT). Several methods are designed to enhance the system performance, including speaker parallel attention, scheduled sampling, curriculum learning and knowledge distillation. More specifically, the speaker parallel attention extends the basic single shared attention module into multiple attention modules for each speaker, which can enhance the tracing and separation ability. Then the scheduled sampling and curriculum learning are proposed to make the model better optimized. Finally the knowledge distillation transfers the knowledge from an original single-speaker model to the current multi-speaker model in the proposed end-to-end multi-talker ASR structure. Our proposed architectures are evaluated and compared on the artificially mixed speech datasets generated from the WSJ0 reading corpus. The experiments demonstrate that our proposed architectures can significantly improve the multi-talker mixed speech recognition. The final system obtains more than 15% relative performance gains in both character error rate (CER) and word error rate (WER) compared to the basic end-to-end multi-talker ASR system.

A Pipelined Framework with Serialized Output Training for Overlapping Speech Recognition

End-to-End Overlapped Speech Detection and Speaker Counting with Raw Waveform.

Serialized Speech Information Guidance with Overlapped Encoding Separation for Multi-Speaker Automatic Speech Recognition

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

Serialized Output Training by Learned Dominance

Progressive Joint Modeling in Unsupervised Single-Channel Overlapped Speech Recognition.

Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings

VADOI:Voice-Activity-Detection Overlapping Inference For End-to-end Long-form Speech Recognition

Advancing Multi-talker ASR Performance with Large Language Models

Audio-visual Recognition of Overlapped speech for the LRS2 dataset

Overlapped speech recognition from a jointly learned multi-channel neural speech extraction and representation

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Cascaded encoders for fine-tuning ASR models on overlapped speech

BeamTransformer: Microphone Array-based Overlapping Speech Detection

BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR

Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification

End-to-End Multi-Speaker Speech Recognition using Speaker Embeddings and Transfer Learning

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator