A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.
Feng Ma,Yanhui Tu,Maokui He,Ruoyu Wang,Shutong Niu,Lei Sun,Zhongfu Ye,Jun Du,Jia Pan,Chin-Hui Lee
DOI: https://doi.org/10.1109/ICASSP48485.2024.10446168
2024-01-01
Abstract:Deep learning (DL)-based speaker diarization methods have proven powerful performance comparing to traditional clustering-based methods for multi-talker speech diarization and recognition in farfield scenes. However, most DL-based approaches cannot utilize the spatial information well due to the poor robustness to unknown array topology and acoustic scenario. In this paper, a spatial long-term iterative mask estimation (SLT-IME) method is proposed to improve the performance of speaker diarization in various real-world acoustic scenarios. First, the complex angular central gaussian mixture model (cACGMM) with diarization results as initial values is used to estimate the presence probability of each speaker at each time-frequency bin, namely speaker masks, in a long-term chunk. Then, the speaker masks are converted to speaker activities according to the threshold, which deliver the diarization information of which speaker is active and when. Finally, the estimated speaker activity can also serve as the initial input for the diarization system, resulting in improved ASR performance. Experimental results on the CHiME-7 three datasets (CHiME-6, DiPCo, Mixer 6) show proposed method can improve diarization and recognition systems performance simultaneously. It also plays a key role in the ensemble system that achieves the best performance in the main track of CHiME-7 DASR Challenge.