Abstract:This paper proposes ESTVocoder, a novel excitation-spectral-transformed neural vocoder within the framework of source-filter theory. The ESTVocoder transforms the amplitude and phase spectra of the excitation into the corresponding speech amplitude and phase spectra using a neural filter whose backbone is ConvNeXt v2 blocks. Finally, the speech waveform is reconstructed through the inverse short-time Fourier transform (ISTFT). The excitation is constructed based on the F0: for voiced segments, it contains full harmonic information, while for unvoiced segments, it is represented by noise. The excitation provides the filter with prior knowledge of the amplitude and phase patterns, expecting to reduce the modeling difficulty compared to conventional neural vocoders. To ensure the fidelity of the synthesized speech, an adversarial training strategy is applied to ESTVocoder with multi-scale and multi-resolution discriminators. Analysis-synthesis and text-to-speech experiments both confirm that our proposed ESTVocoder outperforms or is comparable to other baseline neural vocoders, e.g., HiFi-GAN, SiFi-GAN, and Vocos, in terms of synthesized speech quality, with a reasonable model complexity and generation speed. Additional analysis experiments also demonstrate that the introduced excitation effectively accelerates the model's convergence process, thanks to the speech spectral prior information contained in the excitation.

DCT_M Model for Excitation Parameter in Low Bit Rate Vocoder

The Statistic-Based Prediction of Excitation Spectral Parameters in Low Bit-rate Vocoder

Pitch-Scaled Spectrum Based Excitation Model for HMM-based Speech Synthesis

An Excitation Model Based On Inverse Filtering For Speech Analysis And Synthesis

High Efficient Quantization of the Energy Parameter in 0.6 Kb/s Vocoders

ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram

An Experimental Investigation on Excitation Representation of WaveNet-Based Neural Vocoders

In-band Tone Signal Coding in the Low-Bit-rate Speech Vocoder

Sinusoidal excitation LPC vocoder

HMM estimation of energy contours in speech decoders

A NEW DISTORTION MEASURE FOR PARAMETER QUANTIZATION BASED ON MELP Ye Li 1

Research on MBE Algorithm at Bit Rate 800 Bps-2.4 Kbps Vocoder

A Fractional Bit Allocation Algorithm Based on Mixed Excitation Linear Prediction

Techniques Of Very Low Bit-Rate Speech Coding

Amplitude Spectrum Based Excitation Model For Hmm-Based Speech Synthesis

Low-Complexity 3.6 Kb/s Speech Coding Algorithm Based on Sinusoidal Excitation

Excitation-based Voice Quality Analysis and Modification

High Efficiency MSVQ for Prediction Linear Spectrum Frequency Parameters with Inter-Frame and Inter-Stage Prediction

Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization.

Decomposed Vector Combination-Based Low-Complexity Behavioral Model for Digital Predistortion of RF Transmitters

Voiced/Unvoiced Classification Recovery In The Speech Decoder Based On Gmm