Abstract:Recent advancements in speech synthesis have leveraged GAN-based networks like HiFi-GAN and BigVGAN to produce high-fidelity waveforms from mel-spectrograms. However, these networks are computationally expensive and parameter-heavy. iSTFTNet addresses these limitations by integrating inverse short-time Fourier transform (iSTFT) into the network, achieving both speed and parameter efficiency. In this paper, we introduce an extension to iSTFTNet, termed HiFTNet, which incorporates a harmonic-plus-noise source filter in the time-frequency domain that uses a sinusoidal source from the fundamental frequency (F0) inferred via a pre-trained F0 estimation network for fast inference speed. Subjective evaluations on LJSpeech show that our model significantly outperforms both iSTFTNet and HiFi-GAN, achieving ground-truth-level performance. HiFTNet also outperforms BigVGAN-base on LibriTTS for unseen speakers and achieves comparable performance to BigVGAN while being four times faster with only $1/6$ of the parameters. Our work sets a new benchmark for efficient, high-quality neural vocoding, paving the way for real-time applications that demand high quality speech synthesis.

What problem does this paper attempt to address?

This paper aims to address the problem of achieving efficient and high-quality waveform generation in speech synthesis. Specifically, existing GAN-based networks (such as HiFi-GAN and BigVGAN) can generate high-fidelity waveforms but are computationally expensive and have a large number of parameters, making them unsuitable for real-time applications. To solve these issues, the paper proposes HiFTNet, a neural vocoder extended from iSTFTNet, which introduces a harmonic plus noise source filter in the time-frequency domain and combines it with inverse short-time Fourier transform (iSTFT) to achieve high-speed, low-parameter, and high-quality waveform generation. ### Main Contributions: 1. **Efficiency**: HiFTNet significantly improves inference speed and reduces the number of parameters while maintaining high quality. 2. **High Quality**: Subjective evaluations show that HiFTNet performs comparably to real audio on the LJSpeech dataset and achieves performance comparable to BigVGAN on the LibriTTS dataset. 3. **Innovative Techniques**: The introduction of a harmonic plus noise source filter (hn-NSF) in the time-frequency domain, along with new techniques such as the Snake activation function, enhances the model's performance. ### Experimental Results: - On the LJSpeech dataset, HiFTNet achieved a CMOS score of -0.06 (p≫0.05), comparable to real audio. - On the LibriTTS dataset, HiFTNet achieved a CMOS score of 0.21 (p<0.05) on the test-clean subset, significantly outperforming BigVGAN, with an inference speed 2.5 times faster and the same GPU memory usage. - On the test-other subset, HiFTNet achieved a CMOS score of -0.05 (p≫0.05), comparable to BigVGAN, but with an inference speed 4 times faster and GPU memory usage only 1/6 of BigVGAN. ### Conclusion: By introducing a series of innovative techniques, HiFTNet successfully improves inference speed and reduces the number of parameters while maintaining high quality, providing a new benchmark for real-time speech synthesis applications.

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

FastFit: Towards Real-Time Iterative Neural Vocoder by Replacing U-Net Encoder With Multiple STFTs

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

DeepGAN: A Fast and High-Quality Time-Domain-based Neural Vocoder for Low-Resource Scenarios

A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself

APNet2: High-quality and High-efficiency Neural Vocoder with Direct Prediction of Amplitude and Phase Spectra

APNet: An All-Frame-Level Neural Vocoder Incorporating Direct Prediction of Amplitude and Phase Spectra

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

MusicHiFi: Fast High-Fidelity Stereo Vocoding

HiFi++: a Unified Framework for Bandwidth Extension and Speech Enhancement

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

FastSpeech: Fast, Robust and Controllable Text to Speech

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis