Abstract:Introduction: Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA). Methods: This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments. Results: Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios. Discussion: The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

Speaker Diarisation Using 2D Self-attentive Combination of Embeddings

Combination of Deep Speaker Embeddings for Diarisation

Content-Aware Speaker Embeddings for Speaker Diarisation

Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization

Spectral Clustering-Aware Learning of Embeddings for Speaker Diarisation

Speaker Embedding-aware Neural Diarization: a Novel Framework for Overlapped Speech Diarization in the Meeting Scenario

Improved Large-Margin Softmax Loss for Speaker Diarisation

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Geodesic interpolation of frame-wise speaker embeddings for the diarization of meeting scenarios

Deep Self-Supervised Hierarchical Clustering for Speaker Diarization

DyViSE: Dynamic Vision-Guided Speaker Embedding for Audio-Visual Speaker Diarization

Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Speaker Diarization with Lexical Information

Improved Audio Embeddings by Adjacency-Based Clustering with Applications in Spoken Term Detection

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker

Multi-scale speaker embedding-based graph attention networks for speaker diarisation

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

Leveraging Speaker Embeddings in End-to-End Neural Diarization for Two-Speaker Scenarios

Speaker-turn aware diarization for speech-based cognitive assessments