Abstract:Adapting speaker recognition systems to new environments is a widely-used technique to improve a well-performing model learned from large-scale data towards a task-specific small-scale data scenarios. However, previous studies focus on single domain adaptation, which neglects a more practical scenario where training data are collected from multiple acoustic domains needed in forensic scenarios. Audio analysis for forensic speaker recognition offers unique challenges in model training with multi-domain training data due to location/scenario uncertainty and diversity mismatch between reference and naturalistic field recordings. It is also difficult to directly employ small-scale domain-specific data to train complex neural network architectures due to domain mismatch and performance loss. Fine-tuning is a commonly-used method for adaptation in order to retrain the model with weights initialized from a well-trained model. Alternatively, in this study, three novel adaptation methods based on domain adversarial training, discrepancy minimization, and moment-matching approaches are proposed to further promote adaptation performance across multiple acoustic domains. A comprehensive set of experiments are conducted to demonstrate that: 1) diverse acoustic environments do impact speaker recognition performance, which could advance research in audio forensics, 2) domain adversarial training learns the discriminative features which are also invariant to shifts between domains, 3) discrepancy-minimizing adaptation achieves effective performance simultaneously across multiple acoustic domains, and 4) moment-matching adaptation along with dynamic distribution alignment also significantly promotes speaker recognition performance on each domain, especially for the LENA-field domain with noise compared to all other systems.

Adversarial Domain Adaptation for Speaker Verification Using Partially Shared Network.

Cross-lingual Text-independent Speaker Verification using Unsupervised Adversarial Discriminative Domain Adaptation

Robust Cross-Domain Speaker Verification with Multi-Level Domain Adapters.

Unsupervised Domain Adaptation Via Domain Adversarial Training for Speaker Recognition.

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Multi-source Domain Adaptation for Text-independent Forensic Speaker Recognition

Channel Adversarial Training for Cross-channel Text-independent Speaker Recognition

Class-Aware Distribution Alignment Based Unsupervised Domain Adaptation for Speaker Verification

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Adversarial Training Based on Meta-Learning in Unseen Domains for Speaker Verification

Unsupervised Adaptation with Domain Separation Networks for Robust Speech Recognition

Channel Invariant Speaker Embedding Learning with Joint Multi-Task and Adversarial Training

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

Multi-Domain Adaptation by Self-Supervised Learning for Speaker Verification

VAE-based Domain Adaptation for Speaker Verification.

EDITnet: A Lightweight Network for Unsupervised Domain Adaptation in Speaker Verification

Adversarial Speaker Verification.

Distance Metric-Based Open-Set Domain Adaptation for Speaker Verification

Contrastive Learning and Inter-Speaker Distribution Alignment Based Unsupervised Domain Adaptation for Robust Speaker Verification

Adversarial Domain Adaptation with Domain Mixup

CDMA: Cross-Domain Distance Metric Adaptation for Speaker Verification