Abstract:Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the key challenges in Online Speaker Diarization. Specifically, it aims to provide a review of common online speaker diarization methods, filling the gap in the existing literature where there is a lack of specialized discussion on online speaker diarization methods and systems. ### Problem Background Speaker Diarization is a machine - learning task whose goal is to assign audio sequences to corresponding speakers, thus answering the question of "who is speaking at what time". This process is very important for generating complete audio transcripts, especially when combined with Automatic Speech Recognition (ASR), and is widely used in various scenarios such as online meetings, meeting conversations, corporate earnings call conferences, court records, interviews, social media audio / video, etc. However, in some application scenarios, speaker labels need to be obtained immediately after the arrival of audio segments to achieve low - latency processing. For example, in corporate earnings call conferences, decisions are made on whether to sell stocks based on the content of the speech. This type of speaker diarization is called online speaker diarization. ### Main Contributions of the Paper 1. **Review of Historical Development**: Briefly introduced the historical development of online speaker diarization. 2. **Classification and Datasets**: Provided the classification and datasets for training and evaluation. 3. **Methods and Systems**: Discussed in detail various online speaker diarization methods and systems. 4. **Future Challenges**: Pointed out the challenges that still need to be addressed in future research in this field. ### Key Formulas - **Diarization Error Rate (DER)**: \[ \text{DER}=\frac{\text{FA}+\text{MS}+\text{SC}}{\text{Total Duration of Time}} \] where: - FA: False Alarm - MS: Missed Speech - SC: Speaker Confusion - **Jaccard Error Rate (JER)**: \[ \text{JER}=\frac{1}{N_{\text{ref}}}\sum_{i}\frac{\text{FA}_{i}+\text{MS}_{i}}{\text{TOTAL}_{i}} \] ### Main Methods and Technologies 1. **GMM - based Methods**: Early online speaker diarization systems mainly used Gaussian Mixture Model (GMM) to represent speakers through iterative soft clustering. 2. **i - vector Methods**: To solve the problem of voice changes of the same speaker in different recordings in GMM, i - vector was introduced, and the GMM vector was decomposed into multiple components through Joint Factor Analysis (JFA). 3. **Supervised Online Clustering - UIS RNN**: Used a fully - supervised clustering component, combined d - vector and Unbounded Interleaved - State Recurrent Neural Network (UIS RNN). 4. **Turn to Diarize**: Simplified the annotation process by inserting <st> tags in ASR transcripts, used LSTM for d - vector calculation, and combined spectral clustering for online processing. 5. **End - to - End Systems**: Such as FS - EEND, processed all sub - tasks through a single neural network and used self - attention mechanisms to improve performance. ### Summary This paper fills the gap in the existing literature through a comprehensive review of online speaker diarization methods and provides directions and a basis for future research.

A Review of Common Online Speaker Diarization Methods

Systematic Evaluation of Online Speaker Diarization Systems Regarding their Latency

An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

Online Speaker Diarization with Core Samples Selection

An approach to optimize inference of the DIART speaker diarization pipeline

An Integrated Top-Down/Bottom-Up Approach To Speaker Diarization

Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

From Modular to End-to-End Speaker Diarization

Speaker Diarization with Lexical Information

A lightweight approach to real-time speaker diarization: from audio toward audio-visual data streams

Online speaker diarization of meetings guided by speech separation

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge

DiarizationLM: Speaker Diarization Post-Processing with Large Language Models

A Real-time Speaker Diarization System Based on Spatial Spectrum

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

A Quick and Effective Speaker Diarization System.

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment

End-to-end Online Speaker Diarization with Target Speaker Tracking

Speaker Diarization and Identification from Single-Channel Classroom Audio Recording Using Virtual Microphones