Interrelate Training and Clustering for Online Speaker Diarization
Yifan Chen,Gaofeng Cheng,Runyan Yang,Pengyuan Zhang,Yonghong Yan
DOI: https://doi.org/10.1109/taslp.2024.3357033
2024-01-01
Abstract:In clustering-based speaker diarization systems, the embedding clusters for distinctive speakers exhibit wide variability in size and density, posing difficulty for clustering accuracy. In spite of this, with the assistance of the overall distance relationships among speaker embeddings, most of the embeddings can be grouped to the correct cluster by sophisticated offline clustering algorithms. However, in online scenarios, such a complete distance relationships of the embeddings can not be obtained due to the incremental arrival of embeddings. Consequently, determining the number of clusters and then correctly grouping the embeddings become challenging in an online fashion. Furthermore, errors would accumulate quickly over time if the online clustering algorithm assigns the embeddings into clusters erroneously in the beginning. To address these problems, we designed a novel framework for online clustering. To reduce the high variability of speaker embeddings, we proposed the clustering guided embedding extractor training (CGEET) algorithm to encourage similarity between the size of the embedding space for different speakers in attempt to simplify the distance relationships of embeddings. The CGEET algorithm can grasp the distance information of the entire speaker embedding space and provide it to the online clustering algorithm. With this preliminary information, the distance thresholds guided online clustering (DTGOC) algorithm then processes incoming embeddings using a divide-and-conquer approach. It first handles the embeddings with explicit distance relationships and then searches for possible path combination they have with remaining embeddings in an online fashion. Moreover, in order to utilize the distance relationships of embeddings that are far apart in time, an online re-clustering strategy is incorporated in our DTGOC algorithm, which can alleviate error accumulation during online clustering. By implementing the above innovations, our proposed online clustering system achieves 14.00 DER with collar 0.25 at 2.5 s latency on the AISHELL-4, while the DER of the offline agglomerative hierarchical clustering system is 14.54.
engineering, electrical & electronic,acoustics