Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Prabhav Agrawal,Thilo Koehler,Zhiping Xiu,Prashant Serai,Qing He

2024-01-19

Abstract:Neural vocoders model the raw audio waveform and synthesize high-quality audio, but even the highly efficient ones, like MB-MelGAN and LPCNet, fail to run real-time on a low-end device like a smartglass. A pure digital signal processing (DSP) based vocoder can be implemented via lightweight fast Fourier transforms (FFT), and therefore, is a magnitude faster than any neural vocoder. A DSP vocoder often gets a lower audio quality due to consuming over-smoothed acoustic model predictions of approximate representations for the vocal tract. In this paper, we propose an ultra-lightweight differential DSP (DDSP) vocoder that uses a jointly optimized acoustic model with a DSP vocoder, and learns without an extracted spectral feature for the vocal tract. The model achieves audio quality comparable to neural vocoders with a high average MOS of 4.36 while being efficient as a DSP vocoder. Our C++ implementation, without any hardware-specific optimization, is at 15 MFLOPS, surpasses MB-MelGAN by 340 times in terms of FLOPS, and achieves a vocoder-only RTF of 0.003 and overall RTF of 0.044 while running single-threaded on a 2GHz Intel Xeon CPU.

Sound,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

This paper proposes a solution to the problem of real-time performance and device-side efficiency in high-quality speech synthesis. While current neural vocoders can generate high-fidelity audio, they cannot achieve real-time operation on low-power devices such as smart glasses. The paper presents an ultra-lightweight Differentiable Digital Signal Processing (DDSP) vocoder that combines a neural network acoustic model with a traditional Fast Fourier Transform (FFT) based DSP vocoder. In this way, the DDSP vocoder achieves comparable high efficiency to the DSP vocoder without sacrificing audio quality (with an average MOS score of 4.36). When running on a 2GHz Intel Xeon CPU single-threaded, the DDSP vocoder performs only 15 MFLOPS of floating-point operations (FLOPS), which is 340 times faster than MB-MelGAN, with real-time factors (RTF) of only 0.003 (vocoder only) and 0.044 (overall). The key of this approach is that although the DSP vocoder does not have learnable parameters, the entire module is end-to-end differentiable and can learn from the amplitude spectra of real audio. This allows the DDSP vocoder to maintain efficiency while generating audio quality comparable to neural vocoders. In summary, the paper aims to address how to achieve real-time, high-quality speech synthesis on resource-constrained devices, and successfully balances audio quality and computational efficiency through innovative DDSP technology.

Ultra-lightweight Neural Differential DSP Vocoder For High Quality Speech Synthesis

Fast, High-Quality and Parameter-Efficient Articulatory Synthesis using Differentiable DSP

NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband Excitation for Noise-Controllable Waveform Generation

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

SiD-WaveFlow: A Low-Resource Vocoder Independent of Prior Knowledge

DSPGAN: a GAN-based universal vocoder for high-fidelity TTS by time-frequency domain supervision from DSP

Deep Vocoder: Low Bit Rate Compression of Speech with Deep Autoencoder

FreeV: Free Lunch For Vocoders Through Pseudo Inversed Mel Filter

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

Mathematical Vocoder Algorithm : Modified Spectral Inversion for Efficient Neural Speech Synthesis

Neural Homomorphic Vocoder.

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Realtime robust speech communication based on iterative joint source-channel decoding and demodulation algorithm for MELP vocoder

A Neural Denoising Vocoder for Clean Waveform Generation from Noisy Mel-Spectrogram based on Amplitude and Phase Predictions

LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders

A Streamwise GAN Vocoder for Wideband Speech Coding at Very Low Bit Rate

High quality, lightweight and adaptable TTS using LPCNet

Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge

Performance Comparison of Linear Prediction based Vocoders in Linux Platform

Study and development of MELP vocoder