Abstract:Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME ($\mathbf {10.08\%}$), DIHARD II ($\mathbf {24.64\%}$), and AMI ($\mathbf {13.00\%}$) evaluation benchmarks when overlap is considered and no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

Speaker Diarization with Enhancing Speech for the First DIHARD Challenge

An Analysis of Speaker Diarization Fusion Methods for the First DIHARD Challenge

Scenario-Dependent Speaker Diarization for DIHARD-III Challenge

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

A Novel LSTM-Based Speech Preprocessor for Speaker Diarization in Realistic Mismatch Conditions.

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions

TalTech-IRIT-LIS Speaker and Language Diarization Systems for DISPLACE 2024

Exploring Speaker-Related Information in Spoken Language Understanding for Better Speaker Diarization

The RoyalFlush Automatic Speech Diarization and Recognition System for In-Car Multi-Channel Automatic Speech Recognition Challenge

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

DCF-DS: Deep Cascade Fusion of Diarization and Separation for Speech Recognition under Realistic Single-Channel Conditions

The FlySpeech Audio-Visual Speaker Diarization System for MISP Challenge 2022

A Real-time Speaker Diarization System Based on Spatial Spectrum

One model to rule them all ? Towards End-to-End Joint Speaker Diarization and Speech Recognition

Royalflush Speaker Diarization System for ICASSP 2022 Multi-channel Multi-party Meeting Transcription Challenge

Investigating Various Diarization Algorithms for Speaker in the Wild (SITW) Speaker Recognition Challenge

Exploring Detection-based Method For Speaker Diarization @ Ego4D Audio-only Diarization Challenge 2022

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization