Abstract:This paper explores the potential of constructing an AI spoken dialogue system that "thinks how to respond" and "thinks how to speak" simultaneously, which more closely aligns with the human speech production process compared to the current cascade pipeline of independent chatbot and Text-to-Speech (TTS) modules. We hypothesize that Large Language Models (LLMs) with billions of parameters possess significant speech understanding capabilities and can jointly model dialogue responses and linguistic features. We conduct two sets of experiments: 1) Prosodic structure prediction, a typical front-end task in TTS, demonstrating the speech understanding ability of LLMs, and 2) Further integrating dialogue response and a wide array of linguistic features using a unified encoding format. Our results indicate that the LLM-based approach is a promising direction for building unified spoken dialogue systems.

What problem does this paper attempt to address?

### The Problem This Paper Attempts to Solve This paper aims to construct an AI voice dialogue system capable of simultaneously "thinking about how to respond" and "thinking about how to speak," thereby approximating the human speech generation process. Current voice dialogue systems typically adopt a staged pipeline model, i.e., independent chatbot modules and text-to-speech (TTS) modules. This model has the following issues: 1. **Limitations in expressiveness and interactivity**: - Current TTS modules are usually based on small language models (e.g., the BERT model with 10 million parameters), which have limited ability to understand complex dialogue contexts. - The dialogue response generation module (i.e., large language model LLM chatbots) and the TTS module work independently, resulting in the inability to utilize dialogue context information during speech synthesis, which is crucial for generating reasonable and appropriate voice responses. 2. **Differences from the human speech generation process**: - The human speech generation process is a parallel and incremental process, including multiple stages such as conceptualization, formulation, and pronunciation. The existing two-stage pipeline model fundamentally differs from this process. To overcome these issues, the authors propose a unified framework based on large language models (LLM) that can handle both dialogue responses and linguistic features simultaneously, thereby more closely approximating the human speech generation process. Specifically, the paper validates this hypothesis through the following two experiments: 1. **Prosodic Structure Prediction**: - Through the prosodic structure prediction task (a typical task in the TTS frontend), the paper demonstrates the LLM's capability in speech understanding. Experimental results show that prompt-based ChatGPT and fine-tuned ChatGLM models perform excellently in the prosodic structure prediction task, even surpassing traditional methods. 2. **Joint Prediction of Dialogue Responses and Linguistic Features**: - By jointly predicting dialogue responses and various linguistic features (such as characters, duration, pinyin, prosodic hierarchy, highest pitch, and lowest pitch), the paper further validates the LLM's capability in handling multiple tasks. Experimental results show that the fully fine-tuned ChatGLM2-6B model can successfully generate dialogue responses and corresponding linguistic features, although there is some overfitting on the test set. ### Summary The main goal of this paper is to explore the construction of an AI voice dialogue system capable of simultaneously generating dialogue responses and speech features, thereby more closely approximating the human speech generation process. Through experiments on prosodic structure prediction and joint prediction of dialogue responses and linguistic features, the authors validate the potential of a unified framework based on large language models in achieving this goal. However, the research also has some limitations, such as high training costs, insufficient dataset size, and limited expressiveness.

Towards Joint Modeling of Dialogue Response and Speech Synthesis based on Large Language Model

A Full-duplex Speech Dialogue Scheme Based On Large Language Models

Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation

A Survey on Speech Large Language Models

Large Language Model based Situational Dialogues for Second Language Learning

Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

Leveraging LLMs for Dialogue Quality Measurement

Language Model Can Listen While Speaking

Spoken Language Intelligence of Large Language Models for Language Learning

Toward Joint Language Modeling for Speech Units and Text

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

A Survey of Large Language Models

Think Before You Speak: Cultivating Communication Skills of Large Language Models via Inner Monologue

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Roadmap towards Superhuman Speech Understanding using Large Language Models

Sibyl: Empowering Empathetic Dialogue Generation in Large Language Models via Sensible and Visionary Commonsense Inference

Pronunciation Assessment with Multi-modal Large Language Models

Do Large Language Model Understand Multi-Intent Spoken Language ?