Abstract: High-fidelity singing voices usually require higher sampling rate (e.g., 48kHz) to convey expression and emotion. However, higher sampling rate causes the wider frequency band and longer waveform sequences and throws challenges for singing voice synthesis (SVS) in both frequency and time domains. Conventional SVS systems that adopt small sampling rate cannot well address the above challenges. In this paper, we develop HiFiSinger, an SVS system towards high-fidelity singing voice. HiFiSinger consists of a FastSpeech based acoustic model and a Parallel WaveGAN based vocoder to ensure fast training and inference and also high voice quality. To tackle the difficulty of singing modeling caused by high sampling rate (wider frequency band and longer waveform), we introduce multi-scale adversarial training in both the acoustic model and vocoder to improve singing modeling. Specifically, 1) To handle the larger range of frequencies caused by higher sampling rate, we propose a novel sub-frequency GAN (SF-GAN) on mel-spectrogram generation, which splits the full 80-dimensional mel-frequency into multiple sub-bands and models each sub-band with a separate discriminator. 2) To model longer waveform sequences caused by higher sampling rate, we propose a multi-length GAN (ML-GAN) for waveform generation to model different lengths of waveform sequences with separate discriminators. 3) We also introduce several additional designs and findings in HiFiSinger that are crucial for high-fidelity voices, such as adding F0 (pitch) and V/UV (voiced/unvoiced flag) as acoustic features, choosing an appropriate window/hop size for mel-spectrogram, and increasing the receptive field in vocoder for long vowel modeling. Experiment results show that HiFiSinger synthesizes high-fidelity singing voices with much higher quality: 0.32/0.44 MOS gain over 48kHz/24kHz baseline and 0.83 MOS gain over previous SVS systems.

MusicHiFi: Fast High-Fidelity Stereo Vocoding

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

High-Fidelity Audio Compression with Improved RVQGAN

A High Fidelity and Low Complexity Neural Audio Coding

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

DeepGAN: A Fast and High-Quality Time-Domain-based Neural Vocoder for Low-Resource Scenarios

EVA-GAN: Enhanced Various Audio Generation via Scalable Generative Adversarial Networks

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Perceiving Music Quality with GANs

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder