Abstract:Significant progress has recently been made in speaker diarisation after the introduction of d-vectors as speaker embeddings extracted from neural network (NN) speaker classifiers for clustering speech segments. To extract better-performing and more robust speaker embeddings, this paper proposes a c-vector method by combining multiple sets of complementary d-vectors derived from systems with different NN components. Three structures are used to implement the c-vectors, namely 2D self-attentive, gated additive, and bilinear pooling structures, relying on attention mechanisms, a gating mechanism, and a low-rank bilinear pooling mechanism respectively. Furthermore, a neural-based single-pass speaker diarisation pipeline is also proposed in this paper, which uses NNs to achieve voice activity detection, speaker change point detection, and speaker embedding extraction. Experiments and detailed analyses are conducted on the challenging AMI and NIST RT05 datasets which consist of real meetings with 4–10 speakers and a wide range of acoustic conditions. For systems trained on the AMI training set, relative speaker error rate (SER) reductions of 13% and 29% are obtained by using c-vectors instead of d-vectors on the AMI dev and eval sets respectively, and a relative SER reduction of 15% in SER is observed on RT05, which shows the robustness of the proposed methods. By incorporating VoxCeleb data into the training set, the best c-vector system achieved 7%, 17% and 16% relative SER reduction compared to the d-vector on the AMI dev, eval and RT05 sets respectively.

DGC-vector: A new speaker embedding for zero-shot voice conversion

Zero-shot voice conversion based on feature disentanglement

Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning

One-Shot Voice Conversion with Global Speaker Embeddings

SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System for Both Human Beings and Machines

Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

Vec-Tok-VC+: Residual-enhanced Robust Zero-shot Voice Conversion with Progressive Constraints in a Dual-mode Training Strategy

Improved deep speaker feature learning for text-dependent speaker recognition

GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment

Combination of Deep Speaker Embeddings for Diarisation

Deep Speaker Vectors for Semi Text-independent Speaker Verification

Residual Speaker Representation for One-Shot Voice Conversion

AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

CoDiff-VC: A Codec-Assisted Diffusion Model for Zero-shot Voice Conversion

DeltaVLAD: an Efficient Optimization Algorithm to Discriminate Speaker Embedding for Text-Independent Speaker Verification

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

One-Shot Voice Conversion by Vector Quantization