Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Yinghao Aaron Li,Xilin Jiang,Jordan Darefsky,Ge Zhu,Nima Mesgarani
2024-08-13
Abstract:The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster.
Computation and Language,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges in the current voice dialogue systems when generating natural and fluent voice dialogues: 1. **Limitations of Traditional Cascade Systems (ASR - LLM - TTS)**: - **Lack of Emotion and Style**: Traditional cascade systems, which use a pipeline of automatic speech recognition (ASR), large - language models (LLM), and text - to - speech (TTS) models for processing, are effective but lack direct interaction between audio input and output, resulting in unnatural emotions and styles in the generated voices. - **Latency Issues**: Due to the autoregressive decoding in the ASR process, the real - time performance of the entire system is limited, especially in real - time applications. 2. **Challenges of End - to - End (E2E) Voice Dialogue Systems**: - **Data Requirements and Computational Resources**: Although end - to - end models can theoretically generate voices more directly, in practical applications, they face problems with data acquisition and processing speed and require a large amount of computational resources, making them less feasible in real - time applications. - **Semantic Coherence**: When training end - to - end dialogue systems from scratch, the generated dialogue responses are often not semantically coherent enough. To solve these problems, the paper proposes the **Style - Talker** framework, which innovatively integrates the audio LLM and style - based TTS models in the following ways: - **Directly Process Audio Input**: Style - Talker directly processes the user - input audio and uses the transcribed chat history and voice style to generate the response text and speaking style. - **Reduce ASR Dependence**: While generating the response, the system performs ASR processing on the current input audio to extract the transcription and speaking style as the context for subsequent dialogues, thereby reducing the dependence on ASR and improving the real - time performance of the system. - **Emotional and Stylistic Consistency**: By training the audio LLM to generate response texts and their associated speaking styles, Style - Talker can synthesize voices that reflect the expected emotions and styles, making the dialogue more natural and coherent. Experimental results show that Style - Talker is significantly superior to traditional cascade systems and end - to - end baseline models in terms of the naturalness and coherence of the dialogue, and is also more than 50% faster in processing speed.