Abstract:Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.
What problem does this paper attempt to address?
This paper aims to address the problem of achieving efficient and high-quality waveform generation in speech synthesis. Specifically, existing GAN-based networks (such as HiFi-GAN and BigVGAN) can generate high-fidelity waveforms but are computationally expensive and have a large number of parameters, making them unsuitable for real-time applications. To solve these issues, the paper proposes HiFTNet, a neural vocoder extended from iSTFTNet, which introduces a harmonic plus noise source filter in the time-frequency domain and combines it with inverse short-time Fourier transform (iSTFT) to achieve high-speed, low-parameter, and high-quality waveform generation.
### Main Contributions:
1. **Efficiency**: HiFTNet significantly improves inference speed and reduces the number of parameters while maintaining high quality.
2. **High Quality**: Subjective evaluations show that HiFTNet performs comparably to real audio on the LJSpeech dataset and achieves performance comparable to BigVGAN on the LibriTTS dataset.
3. **Innovative Techniques**: The introduction of a harmonic plus noise source filter (hn-NSF) in the time-frequency domain, along with new techniques such as the Snake activation function, enhances the model's performance.
### Experimental Results:
- On the LJSpeech dataset, HiFTNet achieved a CMOS score of -0.06 (p≫0.05), comparable to real audio.
- On the LibriTTS dataset, HiFTNet achieved a CMOS score of 0.21 (p<0.05) on the test-clean subset, significantly outperforming BigVGAN, with an inference speed 2.5 times faster and the same GPU memory usage.
- On the test-other subset, HiFTNet achieved a CMOS score of -0.05 (p≫0.05), comparable to BigVGAN, but with an inference speed 4 times faster and GPU memory usage only 1/6 of BigVGAN.
### Conclusion:
By introducing a series of innovative techniques, HiFTNet successfully improves inference speed and reduces the number of parameters while maintaining high quality, providing a new benchmark for real-time speech synthesis applications.