Abstract:Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention. This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate results on different relevant corpora. Then, we move towards end-to-end neural diarization (EEND) methods. Due to the need for large training sets for training these models and the lack of manually annotated diarization data in sufficient quantities, the compromise solution consists in generating training data artificially. We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps. We show how this method generating ``simulated conversations'' allows for better performance than using a previously proposed method for creating ``simulated mixtures'' when training the popular EEND with encoder-decoder attractors (EEND-EDA). We also propose a new EEND-based model, which we call DiaPer, and show that it can perform better than EEND-EDA, especially when dealing with many speakers and handling overlapped speech. Finally, we compare both VBx-based and DiaPer systems on a wide variety of corpora and comment on the advantages of each technique.

The Third DIHARD Diarization Challenge

The Second DIHARD Diarization Challenge: Dataset, task, and baselines.

Scenario-Dependent Speaker Diarization for DIHARD-III Challenge

Speaker Diarization with Enhancing Speech for the First DIHARD Challenge

Summary of the DISPLACE Challenge 2023 - DIarization of SPeaker and LAnguage in Conversational Environments

An Analysis of Speaker Diarization Fusion Methods for the First DIHARD Challenge

The Second DISPLACE Challenge : DIarization of SPeaker and LAnguage in Conversational Environments

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization

Speaker-conversation factorial designs for diarization error analysis

The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments

Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge

Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization

An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

A Review of Common Online Speaker Diarization Methods

Improving Separation-Based Speaker Diarization Via Iterative Model Refinement And Speaker Embedding Based Post-Processing

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment

An Integrated Top-Down/Bottom-Up Approach To Speaker Diarization

Speech Diarization and ASR with GMM

TCG CREST System Description for the Second DISPLACE Challenge

From Modular to End-to-End Speaker Diarization