PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings

Joonas Kalda,Clément Pagés,Ricard Marxer,Tanel Alumäe,Hervé Bredin
2024-03-05
Abstract:A major drawback of supervised speech separation (SSep) systems is their reliance on synthetic data, leading to poor real-world generalization. Mixture invariant training (MixIT) was proposed as an unsupervised alternative that uses real recordings, yet struggles with overseparation and adapting to long-form audio. We introduce PixIT, a joint approach that combines permutation invariant training (PIT) for speaker diarization (SD) and MixIT for SSep. With a small extra requirement of needing SD labels, it solves the problem of overseparation and allows stitching local separated sources leveraging existing work on clustering-based neural SD. We measure the quality of the separated sources via applying automatic speech recognition (ASR) systems to them. PixIT boosts the performance of various ASR systems across two meeting corpora both in terms of the speaker-attributed and utterance-based word error rates while not requiring any fine-tuning.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper explores two key issues in multi-speaker recordings: speech separation (SSep) and speaker diarization (SD). Traditional speech separation systems rely on synthesized data, which leads to poor generalization in the real world. MixIT is proposed as an unsupervised alternative to mixture invariant training (MixIT), but it faces challenges of over-separation and adapting to long audio. This paper introduces PixIT, a joint training method that combines permutation invariant training (PIT) with MixIT by incorporating speaker diarization. PixIT solves the problem of over-separation with a small number of speaker diarization labels and utilizes existing clustering-based speaker diarization techniques to concatenate locally separated sources. PixIT improves the modeling of real-world mixtures by creating mixed mixtures (MoM) with a finite maximum number of speakers. It processes mixtures and MoM simultaneously, providing predictions for separated sources and corresponding speaker activities. During training, PixIT combines the PIT loss for both the original mixture and MoM with the MixIT loss for MoM. This approach allows PixIT to stitch separated sources together through speaker activations when handling long-form audio. The paper evaluates the quality of separated long-form sources using various automatic speech recognition (ASR) systems and observes performance improvements in all ASR systems on two conference corpora (AMI and AliMeeting), especially in terms of speaker-attributed word error rate. Moreover, PixIT does not require fine-tuning of ASR systems. In summary, this paper attempts to improve the performance of speech separation and speaker diarization in real-world multi-speaker recordings through joint training, achieving better long-form audio processing.