Abstract:Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME ($\mathbf {10.08\%}$), DIHARD II ($\mathbf {24.64\%}$), and AMI ($\mathbf {13.00\%}$) evaluation benchmarks when overlap is considered and no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.

A neural prosody encoder for end-ro-end dialogue act classification

Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Speech-Aware Neural Diarization with Encoder-Decoder Attractor Guided by Attention Constraints

A neural speech decoding framework leveraging deep learning and speech synthesis

Aalto's End-to-End DNN systems for the INTERSPEECH 2020 Computational Paralinguistics Challenge

Improving End-to-End SLU performance with Prosodic Attention and Distillation

Improving End-to-End Neural Diarization Using Conversational Summary Representations

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

Towards Word-Level End-to-End Neural Speaker Diarization with Auxiliary Network

EDA: Enriching Emotional Dialogue Acts using an Ensemble of Neural Annotators

NeuSpeech: Decode Neural signal as Speech

Deliberation Model Based Two-Pass End-to-End Speech Recognition

Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models

End-to-end translation of human neural activity to speech with a dual-dual generative adversarial network

Conversational End-to-End TTS for Voice Agent

CochleaNet: A Robust Language-independent Audio-Visual Model for Speech Enhancement

Dialogue Act Recognition Via CRF-Attentive Structured Network

Improved Dynamic Memory Network for Dialogue Act Classification with Adversarial Training

Modeling the Acoustic Correlates of Dialog Act for Expressive Chinese TTS Synthesis