Online speaker diarization of meetings guided by speech separation

Elio Gruttadauria,Mathieu Fontaine,Slim Essid

2024-01-30

Abstract:Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.

Audio and Speech Processing,Machine Learning,Sound,Signal Processing

What problem does this paper attempt to address?

This paper explores how to solve the speaker diarization problem in online meeting speech recognition, especially the challenge of handling overlapping speech. Current methods perform poorly when dealing with real data, especially in scenarios with multiple speakers. The paper proposes a new speaker diarization scheme based on speech separation guidance, which is suitable for online processing of long meeting recordings and can adapt to variable numbers of speakers, such as those in the AMI corpus. In the study, the authors used two different speech separation networks (ConvTasNet and DPRNN), each outputting two or three source signals. Then speech activity detection is performed on each estimated source, and speaker diarization results are generated through end-to-end fine-tuning and incremental clustering. The system runs on short time segments and merges local predictions through speaker embeddings and incremental clustering. Experimental results show that the system improves the state-of-the-art performance in the online setting of the AMI headset-mix data set, particularly in handling overlapping speech. Although the speech separation models face difficulties when the number of speakers does not match between training and testing stages, the system is able to adapt to any number of speakers. The paper also validates the effect of each component and the performance of different speech separation models through comparative experiments. In conclusion, this paper addresses the problem of handling overlapping speech in online speaker diarization systems and achieves state-of-the-art performance on the AMI dataset.

Online speaker diarization of meetings guided by speech separation

Multi-channel Conversational Speaker Separation via Neural Diarization

Unified Modeling of Multi-Talker Overlapped Speech Recognition and Diarization with a Sidecar Separator

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

INTEGRATION OF SPEECH SEPARATION, DIARIZATION, AND RECOGNITION FOR MULTI-SPEAKER MEETINGS: SYSTEM DESCRIPTION, COMPARISON, AND ANALYSIS

From Modular to End-to-End Speaker Diarization

Spatial-Temporal Activity-Informed Diarization and Separation

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

End-to-end Online Speaker Diarization with Target Speaker Tracking

A Real-time Speaker Diarization System Based on Spatial Spectrum

An Integrated Top-Down/Bottom-Up Approach To Speaker Diarization

A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions

Simultaneous Diarization and Separation of Meetings through the Integration of Statistical Mixture Models

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario

Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings

Meeting Recognition with Continuous Speech Separation and Transcription-Supported Diarization

Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks