Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

Reo Yoneyama,Atsushi Miyashita,Ryuichi Yamamoto,Tomoki Toda
2024-11-11
Abstract:Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve the aliasing problems in neural vocoders caused by non - linear operations in the time domain and resampling layers. Specifically, these problems are mainly reflected in the following aspects: 1. **Complicating the waveform generation process**: Aliasing forces subsequent layers to deal with these aliasing effects, increasing the computational complexity. 2. **Limited extrapolation performance**: Especially when dealing with high fundamental frequencies (F0), aliasing will reduce the perceived quality of the generated speech waveforms. To solve these problems, the authors propose Wavehax, an alias - free neural waveform generator based on two - dimensional convolution and harmonic prior for reliable complex spectrogram estimation. The main contributions of Wavehax include: - **Alias - free waveform synthesis**: By combining time - frequency domain processing and 2D convolution, the aliasing problems caused by non - linear operations in the time domain are avoided. - **Efficient and robust complex spectrogram estimation**: The harmonic prior is used to enhance the model's ability to generate harmonic components, thereby improving the accuracy and robustness of complex spectrogram estimation. - **High - performance performance**: Experimental results show that Wavehax is comparable to existing high - fidelity neural vocoders in terms of speech quality and performs excellently in high - F0 extrapolation scenarios. In addition, the computational efficiency of Wavehax is significantly better than that of HiFi - GAN V1, requiring less than 5% of the multiply - accumulate operations and model parameters, and the inference speed is increased by more than four times. In summary, this paper aims to solve the aliasing problems in neural vocoders through Wavehax and improve the quality and efficiency of waveform generation.