Voice Conversion Augmentation for Speaker Recognition on Defective Datasets

Ruijie Tao,Zhan Shi,Yidi Jiang,Tianchi Liu,Haizhou Li
2024-04-01
Abstract:Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world applications. Previous works usually studied specific solutions for each scenario from the algorithm perspective. However, the root cause of these problems lies in dataset imperfections. To address these challenges with a unified solution, we propose the Voice Conversion Augmentation (VCA) strategy to obtain pseudo speech from the training set. Furthermore, to guarantee generation quality, we designed the VCA-NN~(nearest neighbours) strategy to select source speech from utterances that are close to the target speech in the representation space. Our experimental results on three created datasets demonstrated that VCA-NN effectively mitigates these dataset problems, which provides a new direction for handling the speaker recognition problems from the data aspect.
Audio and Speech Processing
What problem does this paper attempt to address?
This paper primarily discusses how to address the challenges faced by speech recognition systems when dealing with defective datasets, such as partially labeled, small-scale, and imbalanced datasets. The authors propose a strategy called Voice Conversion Augmentation (VCA) that utilizes voice conversion techniques to generate pseudo-voice and enhance training data. The VCA-NN (Nearest Neighbor) strategy is designed to select source voices that are similar to the target voice in the representation space to ensure the quality of the generated voices. Experimental results on three representative datasets demonstrate that VCA-NN effectively mitigates these dataset issues and offers new insights into addressing speech recognition problems from a data perspective. 1. Problems addressed in the paper: - How to deal with partially labeled datasets to improve the effective utilization of unlabeled speech segments for semi-supervised speech recognition. - How to handle small-scale datasets to train more generalized models. - How to address imbalanced speaker distributions to prevent models from biasing towards speakers contributing more samples. 2. Solution: - The VCA strategy generates pseudo-voice using voice conversion, enriching the representation space of speakers and enhancing defective datasets. - The VCA-NN strategy selects source voices that are similar to the target voice features, improving the quality of the generated voices. 3. Experimental results: - In three typical scenarios (semi-supervised, small-scale, and imbalanced learning), the VCA-NN method outperforms other specific solutions, improving the performance of speech recognition systems. 4. Further research directions: - The algorithm may be sensitive to low-quality pseudo-voices and requires optimization. - The application speed of VCA on large-scale datasets needs to be addressed.