Abstract:In recent years, speech generation has seen remarkable progress, now achieving one-shot generation capability that is often virtually indistinguishable from real human voice. Integrating such advancements in speech generation with large language models might revolutionize a wide range of applications. However, certain applications, such as assistive conversational systems, require natural and conversational speech generation tools that also operate efficiently in real time. Current state-of-the-art models like VALL-E and SoundStorm, powered by hierarchical neural audio codecs, require large neural components and extensive training data to work well. In contrast, MQTTS aims to build more compact conversational TTS models while capitalizing on smaller-scale real-life conversational speech data. However, its autoregressive nature yields high inference latency and thus limits its real-time usage. In order to mitigate the current limitations of the state-of-the-art TTS models while capitalizing on their strengths, in this work we introduce the Pheme model series that 1) offers compact yet high-performing models, 2) allows for parallel speech generation of 3) natural conversational speech, and 4) it can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models. We also show that through simple teacher-student distillation we can meet significant improvements in voice quality for single-speaker setups on top of pretrained Pheme checkpoints, relying solely on synthetic speech generated by much larger teacher models. Audio samples and pretrained models are available online.

Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

MaskSR: Masked Language Model for Full-band Speech Restoration

Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility

Restorative Speech Enhancement: A Progressive Approach Using SE and Codec Modules

Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data

Utilizing Self-supervised Representations for MOS Prediction

SelfRemaster: Self-Supervised Speech Restoration with Analysis-by-Synthesis Approach Using Channel Modeling

Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

An empirical study on speech restoration guided by self supervised speech representation

Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Pheme: Efficient and Conversational Speech Generation

Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

Progressive Multi-Target Network Based Speech Enhancement with Snr-Preselection for Robust Speaker Diarization

MHTTS: Fast multi-head text-to-speech for spontaneous speech with imperfect transcription

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling

Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration

Towards Robust FastSpeech 2 by Modelling Residual Multimodality