Abstract:Automatic Speaker Verification(ASV) technology has become commonplace in virtual assistants. However, its performance suffers when there is a mismatch between the train and test domains. Mixed bandwidth training, i.e., pooling training data from both domains, is a preferred choice for developing a universal model that works for both narrowband and wideband domains. We propose complementing this technique by performing neural upsampling of narrowband signals, also known as bandwidth extension. We aim to discover and analyze high-performing time-domain Generative Adversarial Network (GAN) based models to improve our downstream state-of-the-art ASV system. We choose GANs since they 1) are powerful for learning conditional distribution and 2) allow flexible plug-in usage as a pre-processor during the training of downstream tasks (ASV) with data augmentation. Prior work mainly focused on feature-domain bandwidth extension and limited experimental setups. We address these limitations by 1) using time-domain extension models, 2) reporting results on three real test sets, 3) extending training data, and 4) devising new test-time schemes. We compare supervised (conditional GAN) and unsupervised GANs (CycleGAN) and demonstrate an average relative improvement in the equal error rate of 8.6 and 7.7, respectively. For further analysis, we study changes in the visual quality of the spectrogram, audio perceptual quality, t-SNE embeddings, and ASV score distributions. We show that our bandwidth extension leads to phenomena such as a shift of telephone (test) embeddings towards wideband (train) signals, a negative correlation of perceptual quality with downstream performance, and condition-independent score calibration.

Transformation of low-quality device-recorded speech to high-quality speech using improved SEGAN model

SE-MelGAN -- Speaker Agnostic Rapid Speech Enhancement

SEGAN: Speech Enhancement Generative Adversarial Network

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement

Noise Prior Knowledge Learning for Speech Enhancement Via Gated Convolutional Generative Adversarial Network

Conditional Generative Adversarial Networks for Speech Enhancement and Noise-Robust Speaker Verification

VSEGAN: Visual Speech Enhancement Generative Adversarial Network

iSEGAN: Improved Speech Enhancement Generative Adversarial Networks

Using Speech Enhancement to Realize Speech Synthesis of Low-Resource Dungan Languages

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

Robust Real-time Audio-Visual Speech Enhancement based on DNN and GAN

CMGAN: Conformer-Based Metric-GAN for Monaural Speech Enhancement

Study of GANs for Noisy Speech Simulation from Clean Speech

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Statistical Parametric Speech Synthesis Incorporating Generative Adversarial Networks

Guided Speech Enhancement Network

Time-domain Speech Enhancement with Generative Adversarial Learning

High Fidelity Speech Enhancement with Band-split RNN

Time-domain Speech Super-resolution with GAN based Modeling for Telephony Speaker Verification

CMGAN: Conformer-based Metric GAN for Speech Enhancement

Single-Channel Speech Quality Enhancement in Mobile Networks Based on Generative Adversarial Networks