Abstract:Neural vocoders often struggle with aliasing in latent feature spaces, caused by time-domain nonlinear operations and resampling layers. Aliasing folds high-frequency components into the low-frequency range, making aliased and original frequency components indistinguishable and introducing two practical issues. First, aliasing complicates the waveform generation process, as the subsequent layers must address these aliasing effects, increasing the computational complexity. Second, it limits extrapolation performance, particularly in handling high fundamental frequencies, which degrades the perceptual quality of generated speech waveforms. This paper demonstrates that 1) time-domain nonlinear operations inevitably introduce aliasing but provide a strong inductive bias for harmonic generation, and 2) time-frequency-domain processing can achieve aliasing-free waveform synthesis but lacks the inductive bias for effective harmonic generation. Building on this insight, we propose Wavehax, an aliasing-free neural WAVEform generator that integrates 2D convolution and a HArmonic prior for reliable Complex Spectrogram estimation. Experimental results show that Wavehax achieves speech quality comparable to existing high-fidelity neural vocoders and exhibits exceptional robustness in scenarios requiring high fundamental frequency extrapolation, where aliasing effects become typically severe. Moreover, Wavehax requires less than 5% of the multiply-accumulate operations and model parameters compared to HiFi-GAN V1, while achieving over four times faster CPU inference speed.

What problem does this paper attempt to address?

This paper attempts to solve the aliasing problems in neural vocoders caused by non - linear operations in the time domain and resampling layers. Specifically, these problems are mainly reflected in the following aspects: 1. **Complicating the waveform generation process**: Aliasing forces subsequent layers to deal with these aliasing effects, increasing the computational complexity. 2. **Limited extrapolation performance**: Especially when dealing with high fundamental frequencies (F0), aliasing will reduce the perceived quality of the generated speech waveforms. To solve these problems, the authors propose Wavehax, an alias - free neural waveform generator based on two - dimensional convolution and harmonic prior for reliable complex spectrogram estimation. The main contributions of Wavehax include: - **Alias - free waveform synthesis**: By combining time - frequency domain processing and 2D convolution, the aliasing problems caused by non - linear operations in the time domain are avoided. - **Efficient and robust complex spectrogram estimation**: The harmonic prior is used to enhance the model's ability to generate harmonic components, thereby improving the accuracy and robustness of complex spectrogram estimation. - **High - performance performance**: Experimental results show that Wavehax is comparable to existing high - fidelity neural vocoders in terms of speech quality and performs excellently in high - F0 extrapolation scenarios. In addition, the computational efficiency of Wavehax is significantly better than that of HiFi - GAN V1, requiring less than 5% of the multiply - accumulate operations and model parameters, and the inference speed is increased by more than four times. In summary, this paper aims to solve the aliasing problems in neural vocoders through Wavehax and improve the quality and efficiency of waveform generation.

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

WaveCycleGAN: Synthetic-to-natural speech waveform conversion using cycle-consistent adversarial networks

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization.

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Neural source-filter waveform models for statistical parametric speech synthesis

Fast Neural Speech Waveform Generative Models With Fully-Connected Layer-Based Upsampling

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

Hierarchical RNNs for Waveform-Level Speech Synthesis

Wave-U-Net Discriminator: Fast and Lightweight Discriminator for Generative Adversarial Network-Based Speech Synthesis

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Neural Harmonic-plus-Noise Waveform Model with Trainable Maximum Voice Frequency for Text-to-Speech Synthesis

Waveform generation for text-to-speech synthesis using pitch-synchronous multi-scale generative adversarial networks