Abstract:Although great progress has been made in automatic speech recognition (ASR), significant performance degradation is still observed when recognizing multi-talker mixed speech. In this paper, we propose and evaluate several architectures to address this problem under the assumption that only a single channel of mixed signal is available. Our technique extends permutation invariant training (PIT) by introducing the front-end feature separation module with the minimum mean square error (MSE) criterion and the back-end recognition module with the minimum cross entropy (CE) criterion. More specifically, during training we compute the average MSE or CE over the whole utterance for each possible utterance-level output-target assignment, pick the one with the minimum MSE or CE, and optimize for that assignment. This strategy elegantly solves the label permutation problem observed in the deep learning based multi-talker mixed speech separation and recognition systems. The proposed architectures are evaluated and compared on an artificially mixed AMI dataset with both two- and three-talker mixed speech. The experimental results indicate that against the state-of-the-art single-talker speech recognition system our proposed architectures can cut the word error rate (WER) by relative 45.0% and 25.0% across all speakers when their energies are comparable, for two- and three-talker mixed speech, respectively. To our knowledge, this is the first work on the single-channel multi-talker mixed speech recognition on the challenging speaker-independent spontaneous large vocabulary continuous speech task.

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

X-CrossNet: A complex spectral mapping approach to target speaker extraction with cross attention speaker embedding fusion

Improving End-to-End Single-Channel Multi-Talker Speech Recognition.

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Multi-encoder multi-resolution framework for end-to-end speech recognition

Single-Channel Multi-talker Speech Recognition with Permutation Invariant Training

Monaural Multi-Talker Speech Recognition With Attention Mechanism And Gated Convolutional Networks

MIMO-SPEECH: END-TO-END MULTI-CHANNEL MULTI-SPEAKER SPEECH RECOGNITION

META-CAT: Speaker-Informed Speech Embeddings via Meta Information Concatenation for Multi-talker ASR

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Mixture Encoder for Joint Speech Separation and Recognition

Disentangling Speakers in Multi-Talker Speech Recognition with Speaker-Aware CTC

SA-SOT: Speaker-Aware Serialized Output Training for Multi-Talker ASR

Real-time End-to-End Monaural Multi-speaker Speech Recognition

Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network.

X-SepFormer: End-to-end Speaker Extraction Network with Explicit Optimization on Speaker Confusion

Advancing Multi-talker ASR Performance with Large Language Models

Gated Cross-Attention for Universal Speaker Extraction: Toward Real-World Applications

CrossNet: Leveraging Global, Cross-Band, Narrow-Band, and Positional Encoding for Single- and Multi-Channel Speaker Separation

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

Improving Multi-Speaker ASR With Overlap-Aware Encoding And Monotonic Attention.