Abstract:Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4–10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative SER reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and 16% relative SER reduction compared to the d-vector on the AMI dev, eval and RT05 sets respectively.

Speaker Embedding Extraction with Multi-feature Integration Structure

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Speaker Embedding Extraction with Phonetic Information

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

Multi-Task Learning with High-Order Statistics for X-vector Based Text-Independent Speaker Verification

Xi-Vector Embedding for Speaker Recognition

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

End-to-End Feature Learning for Text-Independent Speaker Verification

Combination of Deep Speaker Embeddings for Diarisation

Joint Speaker Features Learning for Audio-visual Multichannel Speech Separation and Recognition

P-vectors: A Parallel-Coupled TDNN/Transformer Network for Speaker Verification

Hierarchically Attending Time-Frequency and Channel Features for Improving Speaker Verification

A Feature Integration Network for Multi-Channel Speech Enhancement

A Novel I-Vector Framework Using Multiple Features and PCA for Speaker Recognition in Short Speech Condition

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales.

Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Fusion of deep shallow features and models for speaker recognition

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification