Abstract:This paper proposes ESTVocoder, a novel excitation-spectral-transformed neural vocoder within the framework of source-filter theory. The ESTVocoder transforms the amplitude and phase spectra of the excitation into the corresponding speech amplitude and phase spectra using a neural filter whose backbone is ConvNeXt v2 blocks. Finally, the speech waveform is reconstructed through the inverse short-time Fourier transform (ISTFT). The excitation is constructed based on the F0: for voiced segments, it contains full harmonic information, while for unvoiced segments, it is represented by noise. The excitation provides the filter with prior knowledge of the amplitude and phase patterns, expecting to reduce the modeling difficulty compared to conventional neural vocoders. To ensure the fidelity of the synthesized speech, an adversarial training strategy is applied to ESTVocoder with multi-scale and multi-resolution discriminators. Analysis-synthesis and text-to-speech experiments both confirm that our proposed ESTVocoder outperforms or is comparable to other baseline neural vocoders, e.g., HiFi-GAN, SiFi-GAN, and Vocos, in terms of synthesized speech quality, with a reasonable model complexity and generation speed. Additional analysis experiments also demonstrate that the introduced excitation effectively accelerates the model's convergence process, thanks to the speech spectral prior information contained in the excitation.

An Excitation Model Based On Inverse Filtering For Speech Analysis And Synthesis

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

Pitch-scaled Analysis Based Residual Reconstruction for Speech Analysis and Synthesis

Inverse Filtering Based Harmonic Plus Noise Excitation Model for HMM-Based Speech Synthesis

Sinusoidal excitation LPC vocoder

Speech Enhancement Based On Analysis Synthesis Framework With Improved Pitch Estimation And Spectral Envelope Enhancement

ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram

An Experimental Investigation on Excitation Representation of WaveNet-Based Neural Vocoders

Modulation Spectrum Compensation For Hmm- Based Speech Synthesis Using Line Spectral Pairs

DCT_M Model for Excitation Parameter in Low Bit Rate Vocoder

Speech Enhancement Based on Analysis–Synthesis Framework with Improved Parameter Domain Enhancement

Neural text-to-speech with a modeling-by-generation excitation vocoder

Investigation of the Spectral Envelope Estimation Vocoder and Improved Pitch Estimation Based on the Sinusoidal Speech Model

The Statistic-Based Prediction of Excitation Spectral Parameters in Low Bit-rate Vocoder

Multi-source Based Acoustic Model for Speech Synthesis.

New synthesis method based on LMA vocal tract model

An initial research: Towards accurate pitch extraction for speech synthesis based on BLSTM

Low-Complexity 3.6 Kb/s Speech Coding Algorithm Based on Sinusoidal Excitation

Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information

Leaping Frame Detection and Processing with a 2.4 Kb/s SELP Vocoder