Adversarial Data Augmentation for Robust Speaker Verification

Zhenyu Zhou,Junhui Chen,Namin Wang,Lantian Li,Dong Wang
2024-02-05
Abstract:Data augmentation (DA) has gained widespread popularity in deep speaker models due to its ease of implementation and significant effectiveness. It enriches training data by simulating real-life acoustic variations, enabling deep neural networks to learn speaker-related representations while disregarding irrelevant acoustic variations, thereby improving robustness and generalization. However, a potential issue with the vanilla DA is augmentation residual, i.e., unwanted distortion caused by different types of augmentation. To address this problem, this paper proposes a novel approach called adversarial data augmentation (A-DA) which combines DA with adversarial learning. Specifically, it involves an additional augmentation classifier to categorize various augmentation types used in data augmentation. This adversarial learning empowers the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned speaker embeddings more robust in the face of augmentation variations. Experiments conducted on VoxCeleb and CN-Celeb datasets demonstrate that our proposed A-DA outperforms standard DA in both augmentation matched and mismatched test conditions, showcasing its superior robustness and generalization against acoustic variations.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
This paper proposes a new method called Adversarial Data Augmentation (A-DA) to address the "augmentation residue" issue in traditional Data Augmentation (DA) for speaker verification. In the speaker verification task, the aim is to verify the claimed identity of speech segments. While DA enriches the training data by simulating acoustic variations in real-life, enhancing the model's ability to ignore irrelevant acoustic changes, it may lead to unwanted distortions caused by different types of augmentation, namely augmentation residue. To address this issue, the paper combines DA with Adversarial Learning, introducing an additional augmentation classifier to identify the different types used in DA. This adversarial learning enables the network to generate speaker embeddings that can deceive the augmentation classifier, making the learned embeddings more robust to augmentation variations. Experiments are conducted on the VoxCeleb and CN-Celeb datasets, and the results show that A-DA outperforms standard DA under matching and non-matching augmentation test conditions, demonstrating its superior robustness and generalization ability to acoustic variations. Therefore, the A-DA method aims to improve the robustness of deep speaker models under complex acoustic conditions.