HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

Yinghao Aaron Li,Cong Han,Xilin Jiang,Nima Mesgarani
2023-09-18
Abstract:Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.
Audio and Speech Processing,Artificial Intelligence,Sound
What problem does this paper attempt to address?
This paper aims to address the problem of achieving efficient and high-quality waveform generation in speech synthesis. Specifically, existing GAN-based networks (such as HiFi-GAN and BigVGAN) can generate high-fidelity waveforms but are computationally expensive and have a large number of parameters, making them unsuitable for real-time applications. To solve these issues, the paper proposes HiFTNet, a neural vocoder extended from iSTFTNet, which introduces a harmonic plus noise source filter in the time-frequency domain and combines it with inverse short-time Fourier transform (iSTFT) to achieve high-speed, low-parameter, and high-quality waveform generation. ### Main Contributions: 1. **Efficiency**: HiFTNet significantly improves inference speed and reduces the number of parameters while maintaining high quality. 2. **High Quality**: Subjective evaluations show that HiFTNet performs comparably to real audio on the LJSpeech dataset and achieves performance comparable to BigVGAN on the LibriTTS dataset. 3. **Innovative Techniques**: The introduction of a harmonic plus noise source filter (hn-NSF) in the time-frequency domain, along with new techniques such as the Snake activation function, enhances the model's performance. ### Experimental Results: - On the LJSpeech dataset, HiFTNet achieved a CMOS score of -0.06 (p≫0.05), comparable to real audio. - On the LibriTTS dataset, HiFTNet achieved a CMOS score of 0.21 (p<0.05) on the test-clean subset, significantly outperforming BigVGAN, with an inference speed 2.5 times faster and the same GPU memory usage. - On the test-other subset, HiFTNet achieved a CMOS score of -0.05 (p≫0.05), comparable to BigVGAN, but with an inference speed 4 times faster and GPU memory usage only 1/6 of BigVGAN. ### Conclusion: By introducing a series of innovative techniques, HiFTNet successfully improves inference speed and reduces the number of parameters while maintaining high quality, providing a new benchmark for real-time speech synthesis applications.