Abstract:We introduce an end-to-end neural speech synthesis system that uses the source-filter model of speech production. Specifically, we apply differentiable resonant filters to a glottal waveform generated by a neural vocoder. The aim is to obtain a controllable synthesiser, similar to classic formant synthesis, but with much higher perceptual quality - filling a research gap in current neural waveform generators and responding to hitherto unmet needs in the speech sciences. Our setup generates audio from a core set of phonetically meaningful speech parameters, with the filters providing direct control over formant frequency resonances in synthesis. Direct synthesis control is a key feature for reliable stimulus creation in important speech science experiments. We show that the proposed source-filter method gives better perceptual quality than the industry standard for formant manipulation (i.e., Praat), whilst being competitive in terms of formant frequency control accuracy.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: Although current neural speech synthesis systems have made remarkable progress in speech naturalness, these systems have poor controllability and cannot precisely control the acoustic parameters of speech (such as fundamental frequency and formant frequency) as in the classic spectral synthesis methods. This lack of controllability limits the application of these advanced models in speech science experiments. Specifically, this paper aims to fill the research gap of existing neural waveform generators and respond to the unmet needs in speech science. The author proposes an end - to - end neural speech synthesis system, HiFi - Glot, which uses a source - filter model for speech generation. In particular, it applies a differentiable resonant filter to process the glottal waveform generated by the neural vocoder to obtain higher perceptual quality while maintaining direct control of the formant frequency. ### Main contributions: 1. **Explicit formant frequency control**: Explicit control of formant frequency is achieved through source - filter neural synthesis. 2. **Differentiable all - pole filters**: These filters are differentiable with respect to filtering parameters, excitation signals, and output signals, making them fully compatible with end - to - end training and enabling fast parallel computation using GPUs. ### Experimental verification: To evaluate the proposed HiFi - Glot method, the author compared it with the existing neural formant synthesis model (NFS) and the industry - standard tool Praat. The results show that HiFi - Glot is not only superior to Praat in perceptual quality but also performs excellently in formant frequency control precision. ### Summary: This research combines the high - fidelity of neural speech synthesis and the controllability of the traditional source - filter model, providing speech scientists with a more powerful tool that can manipulate speech features more precisely, thus promoting the development of speech science research.

HiFi-Glot: Neural Formant Synthesis with Differentiable Resonant Filters

Speaker-independent neural formant synthesis

Embedding a Differentiable Mel-cepstral Synthesis Filter to a Neural Speech Synthesis System

HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Source-Filter-Based Generative Adversarial Neural Vocoder for High Fidelity Speech Synthesis

HiFTNet: A Fast High-Quality Neural Vocoder with Harmonic-plus-Noise Filter and Inverse Short Time Fourier Transform

NAS-FM: Neural Architecture Search for Tunable and Interpretable Sound Synthesis based on Frequency Modulation

Formant-Controlled HMM-Based Speech Synthesis.

GlotNet—A Raw Waveform Model for the Glottal Excitation in Statistical Parametric Speech Synthesis

A Neural Vocoder with Hierarchical Generation of Amplitude and Phase Spectra for Statistical Parametric Speech Synthesis

Puffin: pitch-synchronous neural waveform generation for fullband speech on modest devices

Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

Neural source-filter waveform models for statistical parametric speech synthesis

Neural Homomorphic Vocoder.

Fine-Grained and Interpretable Neural Speech Editing

A Fast High-Fidelity Source-Filter Vocoder with Lightweight Neural Modules.

Speaking-Rate-Controllable HiFi-GAN Using Feature Interpolation

MusicHiFi: Fast High-Fidelity Stereo Vocoding

HiFi-WaveGAN: Generative Adversarial Network with Auxiliary Spectrogram-Phase Loss for High-Fidelity Singing Voice Generation

Singing Voice Synthesis Using Differentiable LPC and Glottal-Flow-Inspired Wavetables