Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

Xinyu Zhou,Delong Chen,Yudong Chen
2023-09-20
Abstract:This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### The Problem This Paper Attempts to Solve This paper aims to construct an AI voice dialogue system capable of simultaneously "thinking about how to respond" and "thinking about how to speak," thereby approximating the human speech generation process. Current voice dialogue systems typically adopt a staged pipeline model, i.e., independent chatbot modules and text-to-speech (TTS) modules. This model has the following issues: 1. **Limitations in expressiveness and interactivity**: - Current TTS modules are usually based on small language models (e.g., the BERT model with 10 million parameters), which have limited ability to understand complex dialogue contexts. - The dialogue response generation module (i.e., large language model LLM chatbots) and the TTS module work independently, resulting in the inability to utilize dialogue context information during speech synthesis, which is crucial for generating reasonable and appropriate voice responses. 2. **Differences from the human speech generation process**: - The human speech generation process is a parallel and incremental process, including multiple stages such as conceptualization, formulation, and pronunciation. The existing two-stage pipeline model fundamentally differs from this process. To overcome these issues, the authors propose a unified framework based on large language models (LLM) that can handle both dialogue responses and linguistic features simultaneously, thereby more closely approximating the human speech generation process. Specifically, the paper validates this hypothesis through the following two experiments: 1. **Prosodic Structure Prediction**: - Through the prosodic structure prediction task (a typical task in the TTS frontend), the paper demonstrates the LLM's capability in speech understanding. Experimental results show that prompt-based ChatGPT and fine-tuned ChatGLM models perform excellently in the prosodic structure prediction task, even surpassing traditional methods. 2. **Joint Prediction of Dialogue Responses and Linguistic Features**: - By jointly predicting dialogue responses and various linguistic features (such as characters, duration, pinyin, prosodic hierarchy, highest pitch, and lowest pitch), the paper further validates the LLM's capability in handling multiple tasks. Experimental results show that the fully fine-tuned ChatGLM2-6B model can successfully generate dialogue responses and corresponding linguistic features, although there is some overfitting on the test set. ### Summary The main goal of this paper is to explore the construction of an AI voice dialogue system capable of simultaneously generating dialogue responses and speech features, thereby more closely approximating the human speech generation process. Through experiments on prosodic structure prediction and joint prediction of dialogue responses and linguistic features, the authors validate the potential of a unified framework based on large language models in achieving this goal. However, the research also has some limitations, such as high training costs, insufficient dataset size, and limited expressiveness.