A Review of Common Online Speaker Diarization Methods

Roman Aperdannier,Sigurd Schacht,Alexander Piazza
2024-06-21
Abstract:Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.
Sound,Computation and Language,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenges in Online Speaker Diarization. Specifically, it aims to provide a review of common online speaker diarization methods, filling the gap in the existing literature where there is a lack of specialized discussion on online speaker diarization methods and systems. ### Problem Background Speaker Diarization is a machine - learning task whose goal is to assign audio sequences to corresponding speakers, thus answering the question of "who is speaking at what time". This process is very important for generating complete audio transcripts, especially when combined with Automatic Speech Recognition (ASR), and is widely used in various scenarios such as online meetings, meeting conversations, corporate earnings call conferences, court records, interviews, social media audio / video, etc. However, in some application scenarios, speaker labels need to be obtained immediately after the arrival of audio segments to achieve low - latency processing. For example, in corporate earnings call conferences, decisions are made on whether to sell stocks based on the content of the speech. This type of speaker diarization is called online speaker diarization. ### Main Contributions of the Paper 1. **Review of Historical Development**: Briefly introduced the historical development of online speaker diarization. 2. **Classification and Datasets**: Provided the classification and datasets for training and evaluation. 3. **Methods and Systems**: Discussed in detail various online speaker diarization methods and systems. 4. **Future Challenges**: Pointed out the challenges that still need to be addressed in future research in this field. ### Key Formulas - **Diarization Error Rate (DER)**: \[ \text{DER}=\frac{\text{FA}+\text{MS}+\text{SC}}{\text{Total Duration of Time}} \] where: - FA: False Alarm - MS: Missed Speech - SC: Speaker Confusion - **Jaccard Error Rate (JER)**: \[ \text{JER}=\frac{1}{N_{\text{ref}}}\sum_{i}\frac{\text{FA}_{i}+\text{MS}_{i}}{\text{TOTAL}_{i}} \] ### Main Methods and Technologies 1. **GMM - based Methods**: Early online speaker diarization systems mainly used Gaussian Mixture Model (GMM) to represent speakers through iterative soft clustering. 2. **i - vector Methods**: To solve the problem of voice changes of the same speaker in different recordings in GMM, i - vector was introduced, and the GMM vector was decomposed into multiple components through Joint Factor Analysis (JFA). 3. **Supervised Online Clustering - UIS RNN**: Used a fully - supervised clustering component, combined d - vector and Unbounded Interleaved - State Recurrent Neural Network (UIS RNN). 4. **Turn to Diarize**: Simplified the annotation process by inserting <st> tags in ASR transcripts, used LSTM for d - vector calculation, and combined spectral clustering for online processing. 5. **End - to - End Systems**: Such as FS - EEND, processed all sub - tasks through a single neural network and used self - attention mechanisms to improve performance. ### Summary This paper fills the gap in the existing literature through a comprehensive review of online speaker diarization methods and provides directions and a basis for future research.