Abstract:A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.

What problem does this paper attempt to address?

This paper explores two key issues in multi-speaker recordings: speech separation (SSep) and speaker diarization (SD). Traditional speech separation systems rely on synthesized data, which leads to poor generalization in the real world. MixIT is proposed as an unsupervised alternative to mixture invariant training (MixIT), but it faces challenges of over-separation and adapting to long audio. This paper introduces PixIT, a joint training method that combines permutation invariant training (PIT) with MixIT by incorporating speaker diarization. PixIT solves the problem of over-separation with a small number of speaker diarization labels and utilizes existing clustering-based speaker diarization techniques to concatenate locally separated sources. PixIT improves the modeling of real-world mixtures by creating mixed mixtures (MoM) with a finite maximum number of speakers. It processes mixtures and MoM simultaneously, providing predictions for separated sources and corresponding speaker activities. During training, PixIT combines the PIT loss for both the original mixture and MoM with the MixIT loss for MoM. This approach allows PixIT to stitch separated sources together through speaker activations when handling long-form audio. The paper evaluates the quality of separated long-form sources using various automatic speech recognition (ASR) systems and observes performance improvements in all ASR systems on two conference corpora (AMI and AliMeeting), especially in terms of speaker-attributed word error rate. Moreover, PixIT does not require fine-tuning of ASR systems. In summary, this paper attempts to improve the performance of speech separation and speaker diarization in real-world multi-speaker recordings through joint training, achieving better long-form audio processing.

PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation

Multi-talker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks

Permutation invariant training of deep models for speaker-independent multi-talker speech separation

Single-channel speech separation using Soft-minimum Permutation Invariant Training

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Utterance-level Permutation Invariant Training with Discriminative Learning for Single Channel Speech Separation

Speaker Separation Using Speaker Inventories and Estimated Speech

Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition

Towards Solving Cocktail-Party: The First Method to Build a Realistic Dataset with Ground Truths for Speech Separation

Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization

Unsupervised Multi-channel Separation and Adaptation

Speech Separation with Pretrained Frontend to Minimize Domain Mismatch

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

Probabilistic Permutation Invariant Training for Speech Separation

Discriminative Learning for Monaural Speech Separation Using Deep Embedding Features

Multiple Choice Learning for Efficient Speech Separation with Many Speakers

Adversarial Permutation Invariant Training for Universal Sound Separation

Recognizing Multi-talker Speech with Permutation Invariant Training

Online speaker diarization of meetings guided by speech separation