Abstract:It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing voices. In terms of discriminators, we combine a multi-period discriminator, as originally proposed in HiFiGAN, with a multi-resolution multi-band STFT discriminator. Notably, InstructSing achieves comparable voice quality to other neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA V100 GPU machine\footnote{{Demo page: \href{<a class="link-external link-https" href="https://wavelandspeech.github.io/instructsing/" rel="external noopener nofollow">this https URL</a>}{\texttt{<a class="link-external link-https" href="https://wavelandspeech.github.io/inst" rel="external noopener nofollow">this https URL</a>\\ructsing/}}}}. We plan to open-source our code and pretrained model once the paper get accepted.

DeepGAN: A Fast and High-Quality Time-Domain-based Neural Vocoder for Low-Resource Scenarios

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

MusicHiFi: Fast High-Fidelity Stereo Vocoding

SingGAN: Generative Adversarial Network for High-Fidelity Singing Voice Generation

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoder

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

Voice Conversion with Denoising Diffusion Probabilistic GAN Models

Avocodo: Generative Adversarial Network for Artifact-free Vocoder