Abstract:This paper introduces StyleSpeech, a novel Text-to-Speech~(TTS) system that enhances the naturalness and accuracy of synthesized speech. Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features, improving adaptability and efficiency through the principles of Lower Rank Adaptation~(LoRA). LoRA allows efficient adaptation of style features in pre-trained models. Additionally, we introduce a novel automatic evaluation metric, the LLM-Guided Mean Opinion Score (LLM-MOS), which employs large language models to offer an objective and robust protocol for automatically assessing TTS system performance. Extensive testing on benchmark datasets shows that our approach markedly outperforms existing state-of-the-art baseline methods in producing natural, accurate, and high-quality speech. These advancements not only pushes the boundaries of current TTS system capabilities, but also facilitate the application of TTS system in more dynamic and specialized, such as interactive virtual assistants, adaptive audiobooks, and customized voice for gaming. Speech samples can be found in https://style-speech.vercel.app

What problem does this paper attempt to address?

The paper attempts to address two main issues: 1. **Improving the naturalness and accuracy of speech synthesis**: Existing text-to-speech (TTS) systems often lack variation and style control in synthesized speech, resulting in speech that is not natural or engaging enough. The paper proposes a new TTS framework—StyleSpeech, which introduces a unique Style Decorator structure, enabling the deep learning model to learn both style features and phoneme features simultaneously, thereby enhancing the system's adaptability and efficiency. 2. **Improving the evaluation methods of TTS systems**: Current TTS research lacks standardized evaluation protocols, with most studies relying on subjective Mean Opinion Scores (MOS), which are labor-intensive and susceptible to human perception biases and variations. The paper introduces a new automatic evaluation metric—LLM-Guided Mean Opinion Score (LLM-MOS), utilizing large language models (LLM) to provide a more objective and efficient evaluation method. Specifically, the main contributions of the paper include: - Proposing a novel Style Decorator structure that effectively separates the training of style features and phoneme features, simplifying the process of style adaptation. - Using Low-Rank Adaptation (LoRA) technology to achieve effective fine-tuning of pre-trained models with minimal parameter adjustments, preserving the unique characteristics of phoneme embeddings. - Introducing LLM-MOS, a new automatic evaluation metric that leverages large language models to provide an objective and robust assessment of TTS system performance. - Conducting extensive experiments on well-known benchmark datasets, showing that StyleSpeech improves word error rate and overall score by 15% and 12%, respectively, compared to existing baseline models. These improvements not only push the boundaries of current TTS system capabilities but also promote the application of TTS systems in dynamic and specialized applications, such as interactive virtual assistants, adaptive audiobooks, and customized game voices.

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Fine-grained style control in Transformer-based Text-to-speech Synthesis

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Enhancing Speech Intelligibility in Text-To-Speech Synthesis using Speaking Style Conversion

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec

ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations

AS-Speech: Adaptive Style for Speech Synthesis

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis

Adaptive Text to Speech for Spontaneous Style

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis

Self-supervised Context-aware Style Representation for Expressive Speech Synthesis

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.