Abstract:Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME ($\mathbf {10.08\%}$), DIHARD II ($\mathbf {24.64\%}$), and AMI ($\mathbf {13.00\%}$) evaluation benchmarks when overlap is considered and no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

ANSD-MA-MSE: Adaptive Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding.

Neural Speaker Diarization Using Memory-Aware Multi-Speaker Embedding with Sequence-to-Sequence Architecture

Semi-supervised multi-channel speaker diarization with cross-channel attention

Sequence-to-Sequence Neural Diarization with Automatic Speaker Detection and Representation

Speaker-turn aware diarization for speech-based cognitive assessments

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Multi-channel Conversational Speaker Separation via Neural Diarization

Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization

Speaker Overlap-aware Neural Diarization for Multi-party Meeting Analysis

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

SOT Triggered Neural Clustering for Speaker Attributed ASR

The xmuspeech system for multi-channel multi-party meeting transcription challenge

End-to-end neural speaker diarization with an iterative adaptive attractor estimation

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Online Speaker Adaptation Using Memory-Aware Networks for Speech Recognition