USED: Universal Speaker Extraction and Diarization

Junyi Ao,Mehmet Sinan Yıldırım,Ruijie Tao,Meng Ge,Shuai Wang,Yanmin Qian,Haizhou Li
2024-05-09
Abstract:Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating `who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about `who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified model called Universal Speaker Extraction and Diarization (USED) to address output inconsistency and scenario mismatch issues. It is designed to manage speech mixture with varying overlap ratios and variable number of speakers. We show that the USED model significantly outperforms the competitive baselines for speaker extraction and diarization tasks on LibriMix and SparseLibriMix datasets. We further validate the diarization performance on CALLHOME, a dataset based on real recordings, and experimental results indicate that our model surpasses recently proposed approaches.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in a multi - speaker scenario, how to effectively integrate the two tasks of speaker extraction and speaker diarization to deal with the problems of output inconsistency and scenario mismatch. Specifically: 1. **Speaker extraction** aims to extract the voice of the target speaker from the mixed voice. 2. **Speaker diarization** is used to label when each speaker speaks, that is, "who said what and when". Traditional methods usually handle these two tasks separately, but in practical applications, it is more meaningful to know "who said what and when" at the same time. In addition, these two tasks share a similar goal, that is, to separate speakers, so they can complement each other. ### Main problems - **Output inconsistency**: The output forms of speaker extraction and speaker diarization are different. Speaker extraction generates the voice of a single speaker, while speaker diarization labels the activities of all speakers. - **Scenario mismatch**: Speaker extraction is usually optimized for highly overlapping voices, while speaker diarization deals with sparsely overlapping voices, resulting in a scenario mismatch between training and evaluation. ### Solutions To solve these problems, the author proposes a unified model named **Universal Speaker Extraction and Diarization (USED)**. This model has the following characteristics: - **Universality**: It can handle any number of speakers and any overlapping proportion of voice mixtures. - **Embedding assignment module**: It ensures the consistency of the output quantity and order, supports a variable number of speakers, and avoids output permutation problems. - **Multi - task interaction module**: Through the scenario - aware differentiated loss, it makes the diarization output control whether the target voice is muted, ensuring the time - overlap consistency between the diarization and extraction outputs. ### Experimental results The experimental results show that the USED model performs better than the existing baseline systems on the LibriMix and SparseLibriMix datasets, and also performs well on the CALLHOME dataset based on real recordings. In conclusion, this paper proposes a new method to jointly handle the speaker extraction and speaker diarization tasks, solves the problems of output inconsistency and scenario mismatch in previous methods, and thus improves the performance in practical application scenarios.