Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis

Shijia Liao,Yuxuan Wang,Tianyu Li,Yifan Cheng,Ruoyi Zhang,Rongzhi Zhou,Yijin Xing
2024-11-02
Abstract:Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{<a class="link-external link-https" href="https://github.com/fishaudio/fish-speech" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/fishaudio/fish-speech" rel="external noopener nofollow">this https URL</a>}.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the challenges faced by current Text-to-Speech (TTS) systems in handling complex linguistic features, polyphonic character expressions, and generating natural multilingual speech. These issues are crucial for future AI applications. Specifically, the paper proposes a new framework called Fish-Speech, which addresses these challenges through the following innovations: 1. **Handling Complex Linguistic Features**: Current TTS systems struggle with complex linguistic features, especially in polyphonic characters and cross-language generalization. Fish-Speech leverages Large Language Models (LLMs) to extract linguistic features, eliminating the need for traditional Grapheme-to-Phoneme (G2P) conversion, thereby simplifying the synthesis process and enhancing multilingual support. 2. **Multilingual Speech Synthesis**: Existing TTS systems often require language-specific phoneme rules and dictionaries for multilingual tasks, limiting system scalability and increasing maintenance complexity. By integrating LLMs, Fish-Speech can better handle multilingual text, improving the quality and consistency of multilingual speech synthesis. 3. **High-Quality Speech Synthesis**: To improve the quality of speech synthesis, Fish-Speech introduces a new vocoder architecture called FF-GAN, which combines various quantization techniques to optimize compression ratio and codebook utilization, achieving high-fidelity speech synthesis. 4. **Real-Time Performance**: Fish-Speech achieves efficient code generation and processing through a Dual-Autoregressive (Dual-AR) architecture and optimized acceleration methods, enabling real-time speech synthesis on consumer-grade and high-performance GPUs, significantly reducing latency. ### Main Contributions 1. **Introduction of the Fish-Speech Framework**: This framework utilizes LLMs and a Dual-AR architecture to replace traditional G2P conversion, providing a robust and scalable solution for multilingual speech synthesis. 2. **Development of the FF-GAN Vocoder**: This vocoder integrates various quantization techniques to achieve high-fidelity speech synthesis and optimizes compression ratio and codebook utilization. 3. **Achieving Efficient Real-Time Performance**: Through fish-tech acceleration methods, the system achieves a real-time factor of approximately 1:5 on a consumer-grade NVIDIA RTX 4060 mobile platform and a real-time factor of approximately 1:15 on a high-performance NVIDIA RTX 4090 configuration, with a latency of only 150 milliseconds. ### Experimental Results Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential in TTS technology, particularly in advanced conversational agent tasks in AI applications.