StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Haowei Lou,Helen Paik,Wen Hu,Lina Yao
2024-08-27
Abstract:This paper introduces StyleSpeech, a novel Text-to-Speech~(TTS) system that enhances the naturalness and accuracy of synthesized speech. Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features, improving adaptability and efficiency through the principles of Lower Rank Adaptation~(LoRA). LoRA allows efficient adaptation of style features in pre-trained models. Additionally, we introduce a novel automatic evaluation metric, the LLM-Guided Mean Opinion Score (LLM-MOS), which employs large language models to offer an objective and robust protocol for automatically assessing TTS system performance. Extensive testing on benchmark datasets shows that our approach markedly outperforms existing state-of-the-art baseline methods in producing natural, accurate, and high-quality speech. These advancements not only pushes the boundaries of current TTS system capabilities, but also facilitate the application of TTS system in more dynamic and specialized, such as interactive virtual assistants, adaptive audiobooks, and customized voice for gaming. Speech samples can be found in https://style-speech.vercel.app
Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address two main issues: 1. **Improving the naturalness and accuracy of speech synthesis**: Existing text-to-speech (TTS) systems often lack variation and style control in synthesized speech, resulting in speech that is not natural or engaging enough. The paper proposes a new TTS framework—StyleSpeech, which introduces a unique Style Decorator structure, enabling the deep learning model to learn both style features and phoneme features simultaneously, thereby enhancing the system's adaptability and efficiency. 2. **Improving the evaluation methods of TTS systems**: Current TTS research lacks standardized evaluation protocols, with most studies relying on subjective Mean Opinion Scores (MOS), which are labor-intensive and susceptible to human perception biases and variations. The paper introduces a new automatic evaluation metric—LLM-Guided Mean Opinion Score (LLM-MOS), utilizing large language models (LLM) to provide a more objective and efficient evaluation method. Specifically, the main contributions of the paper include: - Proposing a novel Style Decorator structure that effectively separates the training of style features and phoneme features, simplifying the process of style adaptation. - Using Low-Rank Adaptation (LoRA) technology to achieve effective fine-tuning of pre-trained models with minimal parameter adjustments, preserving the unique characteristics of phoneme embeddings. - Introducing LLM-MOS, a new automatic evaluation metric that leverages large language models to provide an objective and robust assessment of TTS system performance. - Conducting extensive experiments on well-known benchmark datasets, showing that StyleSpeech improves word error rate and overall score by 15% and 12%, respectively, compared to existing baseline models. These improvements not only push the boundaries of current TTS system capabilities but also promote the application of TTS systems in dynamic and specialized applications, such as interactive virtual assistants, adaptive audiobooks, and customized game voices.