Abstract:Modern speaker recognition system relies on abundant and balanced datasets for classification training. However, diverse defective datasets, such as partially-labelled, small-scale, and imbalanced datasets, are common in real-world applications. Previous works usually studied specific solutions for each scenario from the algorithm perspective. However, the root cause of these problems lies in dataset imperfections. To address these challenges with a unified solution, we propose the Voice Conversion Augmentation (VCA) strategy to obtain pseudo speech from the training set. Furthermore, to guarantee generation quality, we designed the VCA-NN~(nearest neighbours) strategy to select source speech from utterances that are close to the target speech in the representation space. Our experimental results on three created datasets demonstrated that VCA-NN effectively mitigates these dataset problems, which provides a new direction for handling the speaker recognition problems from the data aspect.

What problem does this paper attempt to address?

This paper primarily discusses how to address the challenges faced by speech recognition systems when dealing with defective datasets, such as partially labeled, small-scale, and imbalanced datasets. The authors propose a strategy called Voice Conversion Augmentation (VCA) that utilizes voice conversion techniques to generate pseudo-voice and enhance training data. The VCA-NN (Nearest Neighbor) strategy is designed to select source voices that are similar to the target voice in the representation space to ensure the quality of the generated voices. Experimental results on three representative datasets demonstrate that VCA-NN effectively mitigates these dataset issues and offers new insights into addressing speech recognition problems from a data perspective. 1. Problems addressed in the paper: - How to deal with partially labeled datasets to improve the effective utilization of unlabeled speech segments for semi-supervised speech recognition. - How to handle small-scale datasets to train more generalized models. - How to address imbalanced speaker distributions to prevent models from biasing towards speakers contributing more samples. 2. Solution: - The VCA strategy generates pseudo-voice using voice conversion, enriching the representation space of speakers and enhancing defective datasets. - The VCA-NN strategy selects source voices that are similar to the target voice features, improving the quality of the generated voices. 3. Experimental results: - In three typical scenarios (semi-supervised, small-scale, and imbalanced learning), the VCA-NN method outperforms other specific solutions, improving the performance of speech recognition systems. 4. Further research directions: - The algorithm may be sensitive to low-quality pseudo-voices and requires optimization. - The application speed of VCA on large-scale datasets needs to be addressed.

Voice Conversion Augmentation for Speaker Recognition on Defective Datasets

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

AudioVSR: Enhancing Video Speech Recognition with Audio Data

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System

Improving Recognition-Synthesis Based Any-to-one Voice Conversion with Cyclic Training

Data Augmentation for Diverse Voice Conversion in Noisy Environments

Wavoice: an Mmwave-Assisted Noise-Resistant Speech Recognition System.

Exploring Voice Conversion based Data Augmentation in Text-Dependent Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Residual Speaker Representation for One-Shot Voice Conversion

Voice Conversion towards Arbitrary Speakers With Limited Data.

Wavoice: A mmWave-assisted Noise-resistant Speech Recognition SystemJust Accepted

Non-Parallel Voice Conversion with Autoregressive Conversion Model and Duration Adjustment

Who is Authentic Speaker

Improving the Efficiency of Dysarthria Voice Conversion System Based on Data Augmentation

Two-stage and Self-supervised Voice Conversion for Zero-Shot Dysarthric Speech Reconstruction

WaveNet Vocoder with Limited Training Data for Voice Conversion

Building Bilingual and Code-Switched Voice Conversion with Limited Training Data Using Embedding Consistency Loss

NeuralVC: Any-to-Any Voice Conversion Using Neural Networks Decoder for Real-Time Voice Conversion

How far are we from robust voice conversion: a survey

Voice Conversion for Stuttered Speech, Instruments, Unseen Languages and Textually Described Voices