Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak

2024-05-29

Abstract:Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at <a class="link-external link-https" href="https://github.com/gemelo-ai/vocos" rel="external noopener nofollow">this https URL</a>.

Sound,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the gap between time-domain and Fourier-domain neural vocoders in the field of audio synthesis. Specifically: 1. **Direct reconstruction of complex-valued spectra**: Historical methods face challenges in directly reconstructing complex-valued spectra, particularly with issues in phase recovery. 2. **Improving computational efficiency**: Traditional time-domain-based methods have redundancy and are computationally intensive during the upsampling process. 3. **Enhancing audio quality**: The paper proposes a new model, Vocos, which can generate Fourier spectral coefficients and achieve audio quality comparable to the state-of-the-art while significantly improving computational efficiency. Vocos achieves these goals through the following points: - **Generating Fourier spectral coefficients**: Vocos generates Fourier spectral coefficients instead of traditional time-domain waveforms. - **Phase estimation**: The paper proposes a simple activation function to estimate phase angles, which naturally handles the phase wrapping problem. - **Network architecture design**: Vocos employs ConvNeXt blocks and maintains the same feature time resolution throughout the network, avoiding issues brought by transposed convolutions in traditional methods. - **Inverse Fast Fourier Transform (IFFT)**: Utilizing IFFT for upsampling, thereby significantly improving computational efficiency. Overall, Vocos not only achieves state-of-the-art audio quality but also makes significant progress in computational efficiency.

Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

MusicHiFi: Fast High-Fidelity Stereo Vocoding

Analysis by Adversarial Synthesis -- A Novel Approach for Speech Vocoding

Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

WOLONet: Wave Outlooker for Efficient and High Fidelity Speech Synthesis

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

Avocodo: Generative Adversarial Network for Artifact-free Vocoder

Parallel Synthesis for Autoregressive Speech Generation

VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predictive Coding

VNet: A GAN-based Multi-Tier Discriminator Network for Speech Synthesis Vocoders

DeepGAN: A Fast and High-Quality Time-Domain-based Neural Vocoder for Low-Resource Scenarios

Neural Homomorphic Vocoder.

BigVSAN: Enhancing GAN-based Neural Vocoders with Slicing Adversarial Network

Wavehax: Aliasing-Free Neural Waveform Synthesis Based on 2D Convolution and Harmonic Prior for Reliable Complex Spectrogram Estimation

FA-GAN: Artifacts-free and Phase-aware High-fidelity GAN-based Vocoder

A Synthetic Corpus Generation Method for Neural Vocoder Training

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices