HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

Lauri Juvela,Pablo Pérez Zarazaga,Gustav Eje Henter,Zofia Malisz
2024-09-23
Abstract:We introduce an end-to-end neural speech synthesis system that uses the source-filter model of speech production. Specifically, we apply differentiable resonant filters to a glottal waveform generated by a neural vocoder. The aim is to obtain a controllable synthesiser, similar to classic formant synthesis, but with much higher perceptual quality - filling a research gap in current neural waveform generators and responding to hitherto unmet needs in the speech sciences. Our setup generates audio from a core set of phonetically meaningful speech parameters, with the filters providing direct control over formant frequency resonances in synthesis. Direct synthesis control is a key feature for reliable stimulus creation in important speech science experiments. We show that the proposed source-filter method gives better perceptual quality than the industry standard for formant manipulation (i.e., Praat), whilst being competitive in terms of formant frequency control accuracy.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: Although current neural speech synthesis systems have made remarkable progress in speech naturalness, these systems have poor controllability and cannot precisely control the acoustic parameters of speech (such as fundamental frequency and formant frequency) as in the classic spectral synthesis methods. This lack of controllability limits the application of these advanced models in speech science experiments. Specifically, this paper aims to fill the research gap of existing neural waveform generators and respond to the unmet needs in speech science. The author proposes an end - to - end neural speech synthesis system, HiFi - Glot, which uses a source - filter model for speech generation. In particular, it applies a differentiable resonant filter to process the glottal waveform generated by the neural vocoder to obtain higher perceptual quality while maintaining direct control of the formant frequency. ### Main contributions: 1. **Explicit formant frequency control**: Explicit control of formant frequency is achieved through source - filter neural synthesis. 2. **Differentiable all - pole filters**: These filters are differentiable with respect to filtering parameters, excitation signals, and output signals, making them fully compatible with end - to - end training and enabling fast parallel computation using GPUs. ### Experimental verification: To evaluate the proposed HiFi - Glot method, the author compared it with the existing neural formant synthesis model (NFS) and the industry - standard tool Praat. The results show that HiFi - Glot is not only superior to Praat in perceptual quality but also performs excellently in formant frequency control precision. ### Summary: This research combines the high - fidelity of neural speech synthesis and the controllability of the traditional source - filter model, providing speech scientists with a more powerful tool that can manipulate speech features more precisely, thus promoting the development of speech science research.