Abstract:Text-to-Speech (TTS) systems face ongoing challenges in processing complex linguistic features, handling polyphonic expressions, and producing natural-sounding multilingual speech - capabilities that are crucial for future AI applications. In this paper, we present Fish-Speech, a novel framework that implements a serial fast-slow Dual Autoregressive (Dual-AR) architecture to enhance the stability of Grouped Finite Scalar Vector Quantization (GFSQ) in sequence generation tasks. This architecture improves codebook processing efficiency while maintaining high-fidelity outputs, making it particularly effective for AI interactions and voice cloning. Fish-Speech leverages Large Language Models (LLMs) for linguistic feature extraction, eliminating the need for traditional grapheme-to-phoneme (G2P) conversion and thereby streamlining the synthesis pipeline and enhancing multilingual support. Additionally, we developed FF-GAN through GFSQ to achieve superior compression ratios and near 100\% codebook utilization. Our approach addresses key limitations of current TTS systems while providing a foundation for more sophisticated, context-aware speech synthesis. Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential to advance TTS technology in AI applications. The implementation is open source at \href{<a class="link-external link-https" href="https://github.com/fishaudio/fish-speech" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/fishaudio/fish-speech" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the challenges faced by current Text-to-Speech (TTS) systems in handling complex linguistic features, polyphonic character expressions, and generating natural multilingual speech. These issues are crucial for future AI applications. Specifically, the paper proposes a new framework called Fish-Speech, which addresses these challenges through the following innovations: 1. **Handling Complex Linguistic Features**: Current TTS systems struggle with complex linguistic features, especially in polyphonic characters and cross-language generalization. Fish-Speech leverages Large Language Models (LLMs) to extract linguistic features, eliminating the need for traditional Grapheme-to-Phoneme (G2P) conversion, thereby simplifying the synthesis process and enhancing multilingual support. 2. **Multilingual Speech Synthesis**: Existing TTS systems often require language-specific phoneme rules and dictionaries for multilingual tasks, limiting system scalability and increasing maintenance complexity. By integrating LLMs, Fish-Speech can better handle multilingual text, improving the quality and consistency of multilingual speech synthesis. 3. **High-Quality Speech Synthesis**: To improve the quality of speech synthesis, Fish-Speech introduces a new vocoder architecture called FF-GAN, which combines various quantization techniques to optimize compression ratio and codebook utilization, achieving high-fidelity speech synthesis. 4. **Real-Time Performance**: Fish-Speech achieves efficient code generation and processing through a Dual-Autoregressive (Dual-AR) architecture and optimized acceleration methods, enabling real-time speech synthesis on consumer-grade and high-performance GPUs, significantly reducing latency. ### Main Contributions 1. **Introduction of the Fish-Speech Framework**: This framework utilizes LLMs and a Dual-AR architecture to replace traditional G2P conversion, providing a robust and scalable solution for multilingual speech synthesis. 2. **Development of the FF-GAN Vocoder**: This vocoder integrates various quantization techniques to achieve high-fidelity speech synthesis and optimizes compression ratio and codebook utilization. 3. **Achieving Efficient Real-Time Performance**: Through fish-tech acceleration methods, the system achieves a real-time factor of approximately 1:5 on a consumer-grade NVIDIA RTX 4060 mobile platform and a real-time factor of approximately 1:15 on a high-performance NVIDIA RTX 4090 configuration, with a latency of only 150 milliseconds. ### Experimental Results Experimental results show that Fish-Speech significantly outperforms baseline models in handling complex linguistic scenarios and voice cloning tasks, demonstrating its potential in TTS technology, particularly in advanced conversational agent tasks in AI applications.

Fish-Speech: Leveraging Large Language Models for Advanced Multilingual Text-to-Speech Synthesis

SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models

End-to-end Code-switched TTS with Mix of Monolingual Recordings.

Get Large Language Models Ready to Speak: A Late-fusion Approach for Speech Generation

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

SALMONN: Towards Generic Hearing Abilities for Large Language Models

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

The huya multi-speaker and multi-style speech synthesis system for m2voc challenge 2020

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model