Abstract:The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are several key challenges in the current voice dialogue systems when generating natural and fluent voice dialogues: 1. **Limitations of Traditional Cascade Systems (ASR - LLM - TTS)**: - **Lack of Emotion and Style**: Traditional cascade systems, which use a pipeline of automatic speech recognition (ASR), large - language models (LLM), and text - to - speech (TTS) models for processing, are effective but lack direct interaction between audio input and output, resulting in unnatural emotions and styles in the generated voices. - **Latency Issues**: Due to the autoregressive decoding in the ASR process, the real - time performance of the entire system is limited, especially in real - time applications. 2. **Challenges of End - to - End (E2E) Voice Dialogue Systems**: - **Data Requirements and Computational Resources**: Although end - to - end models can theoretically generate voices more directly, in practical applications, they face problems with data acquisition and processing speed and require a large amount of computational resources, making them less feasible in real - time applications. - **Semantic Coherence**: When training end - to - end dialogue systems from scratch, the generated dialogue responses are often not semantically coherent enough. To solve these problems, the paper proposes the **Style - Talker** framework, which innovatively integrates the audio LLM and style - based TTS models in the following ways: - **Directly Process Audio Input**: Style - Talker directly processes the user - input audio and uses the transcribed chat history and voice style to generate the response text and speaking style. - **Reduce ASR Dependence**: While generating the response, the system performs ASR processing on the current input audio to extract the transcription and speaking style as the context for subsequent dialogues, thereby reducing the dependence on ASR and improving the real - time performance of the system. - **Emotional and Stylistic Consistency**: By training the audio LLM to generate response texts and their associated speaking styles, Style - Talker can synthesize voices that reflect the expected emotions and styles, making the dialogue more natural and coherent. Experimental results show that Style - Talker is significantly superior to traditional cascade systems and end - to - end baseline models in terms of the naturalness and coherence of the dialogue, and is also more than 50% faster in processing speed.

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

StyleTTS: A Style-Based Generative Model for Natural and Diverse Text-to-Speech Synthesis

StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

Improving the quality of neural TTS using long-form content and multi-speaker multi-style modeling

PROTOTYPE-TO-STYLE: Dialogue Generation with Style-Aware Editing on Retrieval Memory

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Controllable Speaking Styles Using a Large Language Model

CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis