Abstract:Automatic Speaker Verification(ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for developing a universal model that works for both narrowband and wideband domains. We propose complementing this technique by performing neural upsampling of narrowband signals, also known as bandwidth extension. We aim to discover and analyze high-performing time-domain Generative Adversarial Network (GAN) based models to improve our downstream state-of-the-art ASV system. We choose GANs since they 1) are powerful for learning conditional distribution and 2) allow flexible plug-in usage as a pre-processor during the training of downstream tasks (ASV) with data augmentation. Prior work mainly focused on feature-domain bandwidth extension and limited experimental setups. We address these limitations by 1) using time-domain extension models, 2) reporting results on three real test sets, 3) extending training data, and 4) devising new test-time schemes. We compare supervised (conditional GAN) and unsupervised GANs (CycleGAN) and demonstrate an average relative improvement in the equal error rate of 8.6 and 7.7, respectively. For further analysis, we study changes in the visual quality of the spectrogram, audio perceptual quality, t-SNE embeddings, and ASV score distributions. We show that our bandwidth extension leads to phenomena such as a shift of telephone (test) embeddings towards wideband (train) signals, a negative correlation of perceptual quality with downstream performance, and condition-independent score calibration.

A Cycle-GAN Approach to Model Natural Perturbations in Speech for ASR Applications

Joint Magnitude Estimation and Phase Recovery Using Cycle-In-Cycle GAN for Non-Parallel Speech Enhancement

Exploring Speech Enhancement with Generative Adversarial Networks for Robust Speech Recognition

Emotional Speech Generator by using Generative Adversarial Networks

CycleGAN-based Non-parallel Speech Enhancement with an Adaptive Attention-in-attention Mechanism

Generative adversarial network-based glottal waveform model for statistical parametric speech synthesis

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

MaskCycleGAN-based Whisper to Normal Speech Conversion

Study of GANs for Noisy Speech Simulation from Clean Speech

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

Time-domain Speech Super-resolution with GAN based Modeling for Telephony Speaker Verification

Emotional Voice Conversion With Cycle-consistent Adversarial Network

Generative Adversarial Networks for Unpaired Voice Transformation on Impaired Speech

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks

Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

Personalized Adversarial Data Augmentation for Dysarthric and Elderly Speech Recognition

Enhancing Pre-trained ASR System Fine-tuning for Dysarthric Speech Recognition using Adversarial Data Augmentation

Adversarial Data Augmentation Using VAE-GAN for Disordered Speech Recognition

CycleGAN-VC-GP: Improved CycleGAN-based Non-parallel Voice Conversion

The Effectiveness of Time Stretching for Enhancing Dysarthric Speech for Improved Dysarthric Speech Recognition