Online speaker diarization of meetings guided by speech separation

Elio Gruttadauria,Mathieu Fontaine,Slim Essid
2024-01-30
Abstract:Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.
Audio and Speech Processing,Machine Learning,Sound,Signal Processing
What problem does this paper attempt to address?
This paper explores how to solve the speaker diarization problem in online meeting speech recognition, especially the challenge of handling overlapping speech. Current methods perform poorly when dealing with real data, especially in scenarios with multiple speakers. The paper proposes a new speaker diarization scheme based on speech separation guidance, which is suitable for online processing of long meeting recordings and can adapt to variable numbers of speakers, such as those in the AMI corpus. In the study, the authors used two different speech separation networks (ConvTasNet and DPRNN), each outputting two or three source signals. Then speech activity detection is performed on each estimated source, and speaker diarization results are generated through end-to-end fine-tuning and incremental clustering. The system runs on short time segments and merges local predictions through speaker embeddings and incremental clustering. Experimental results show that the system improves the state-of-the-art performance in the online setting of the AMI headset-mix data set, particularly in handling overlapping speech. Although the speech separation models face difficulties when the number of speakers does not match between training and testing stages, the system is able to adapt to any number of speakers. The paper also validates the effect of each component and the performance of different speech separation models through comparative experiments. In conclusion, this paper addresses the problem of handling overlapping speech in online speaker diarization systems and achieves state-of-the-art performance on the AMI dataset.