Abstract:We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross-entropy loss, we apply regression loss with a proposed spectrogram flux loss function to model the probability distribution of the continuous-valued tokens. (ii) we have incorporated variational inference into MELLE to facilitate sampling mechanisms, thereby enhancing the output diversity and model robustness. Experiments demonstrate that, compared to the two-stage codec language models VALL-E and its variants, the single-stage MELLE mitigates robustness issues by avoiding the inherent flaws of sampling discrete codes, achieves superior performance across multiple metrics, and, most importantly, offers a more streamlined paradigm. See <a class="link-external link-https" href="https://aka.ms/melle" rel="external noopener nofollow">this https URL</a> for demos of our work.

Study and development of MELP vocoder

Study and Development of Time Varying Bit-rate MELP Vocoder

An Improved 2.4kb/s Mixed Excitation Linear Prediction Vocoder

Realtime robust speech communication based on iterative joint source-channel decoding and demodulation algorithm for MELP vocoder

New Mixed Excitation Linear Prediction Codec at 2.4 kb/s

Design and Description of a 600 bps Speech Coder Based on Melpe

An Improved MELP Speech Coder

A Hybrid Structure Speech Coding Scheme Based on MELPe and LPCNet

Improvement of Mixed Excitation Linear Prediction Speech Coding Algorithm and Implementation on DSP System

Optimization for Algorithm of MELP Speech Codec Based on DSP Platform

A DSP-based Low Rate Speech Coding Technology in Short-Wave Communications

Performance Comparison of Linear Prediction based Vocoders in Linux Platform

Sinusoidal excitation LPC vocoder

Research on MBE Algorithm at Bit Rate 800 Bps-2.4 Kbps Vocoder

Design and Implementation of a Low Bit Rate Programmable Vocoder

Improvement of Voiced-Unvoiced Classification in Vocoders

Autoregressive Speech Synthesis without Vector Quantization

The Influence of Clipping on the Performance of a Low Bit Rate Parametric Speech Coder

Research on real time low bit-rate speech coding

Research on Low Delay Low Bit Rate Speech Coding Algorithm

High Quality Harmonic Excitation Linear Predictive Speech Coding at 2 Kb/s