Abstract:Speaker diarization is usually referred to as the task that determines ``who spoke when'' in a recording. Until a few years ago, all competitive approaches were modular. Systems based on this framework reached state-of-the-art performance in most scenarios but had major difficulties dealing with overlapped speech. More recently, the advent of end-to-end models, capable of dealing with all aspects of speaker diarization with a single model and better performing regarding overlapped speech, has brought high levels of attention. This thesis is framed during a period of co-existence of these two trends. We describe a system based on a Bayesian hidden Markov model used to cluster x-vectors (speaker embeddings obtained with a neural network), known as VBx, which has shown remarkable performance on different datasets and challenges. We comment on its advantages and limitations and evaluate results on different relevant corpora. Then, we move towards end-to-end neural diarization (EEND) methods. Due to the need for large training sets for training these models and the lack of manually annotated diarization data in sufficient quantities, the compromise solution consists in generating training data artificially. We describe an approach for generating synthetic data which resembles real conversations in terms of speaker turns and overlaps. We show how this method generating ``simulated conversations'' allows for better performance than using a previously proposed method for creating ``simulated mixtures'' when training the popular EEND with encoder-decoder attractors (EEND-EDA). We also propose a new EEND-based model, which we call DiaPer, and show that it can perform better than EEND-EDA, especially when dealing with many speakers and handling overlapped speech. Finally, we compare both VBx-based and DiaPer systems on a wide variety of corpora and comment on the advantages of each technique.

End-to-End Neural Speaker Diarization with Absolute Speaker Loss

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

End-to-end neural speaker diarization with an iterative adaptive attractor estimation

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Speech-Aware Neural Diarization with Encoder-Decoder Attractor Guided by Attention Constraints

EEND-DEMUX: End-to-End Neural Speaker Diarization via Demultiplexed Speaker Embeddings

From Modular to End-to-End Speaker Diarization

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Neural Speaker Diarization with Speaker-Wise Chain Rule

End-to-end speaker segmentation for overlap-aware resegmentation

Improving End-to-End Neural Diarization Using Conversational Summary Representations

LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization

End-to-End Feature Learning for Text-Independent Speaker Verification

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances