Abstract:In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem in current large - language models (LLMs) when handling spoken - language dialogues: **Even if two current dialogue turns are the same sentence, but if they are spoken in different speaking styles, the responses may be different**. However, existing text - based LLMs are unable to give different responses according to speaking styles because they only rely on text information and ignore non - linguistic information such as intonation and emotion in speech. Specifically, the author points out: 1. **Differences between text and speech modalities**: Text - based LLMs can only generate responses based on text content, ignoring the paralinguistic and prosodic information (such as emotion, speech rate, volume, etc.) contained in speech, which plays a crucial role in human conversations. 2. **Deficiencies in existing datasets**: Currently, there are no suitable datasets to train models to understand the impact of different speaking styles on responses. Therefore, the author has collected a new dataset named **StyleTalk**, which contains samples with the same dialogue context and input sentences but different speaking styles, along with corresponding expressive speech responses. 3. **Multi - modal fusion framework**: In order to enable LLMs to understand and respond to different speaking styles, the author proposes a framework named **Spoken - LLM**. This framework combines open - source large - language models (such as Llama 2 - Chat) and self - supervised speech - emotion representation models (such as emotion2vec), and uses a two - stage training method to learn speaking styles and generate natural speech responses. 4. **Experimental verification**: Through a series of objective and subjective evaluations, the author has proven that Spoken - LLM is superior to existing text - based and speech - based baseline models in generating more reasonable responses that are more in line with speaking styles. ### Summary The core problem of this paper is **how to make large - language models generate appropriate responses according to different speaking styles in spoken - language dialogues**. To this end, the author has successfully improved the performance of the model on this task by constructing a new dataset and proposing a new framework.

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

StyleChat: Learning Recitation-Augmented Memory in LLMs for Stylized Dialogue Generation

Controllable Speaking Styles Using a Large Language Model

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Spoken Language Intelligence of Large Language Models for Language Learning

A Survey on Speech Large Language Models

Meta-Tuning LLMs to Leverage Lexical Knowledge for Generalizable Language Style Understanding

Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation

CAT-LLM: Prompting Large Language Models with Text Style Definition for Chinese Article-style Transfer

Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis.

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech

Large Language Models Know What To Say But Not When To Speak

Recent Advances in Speech Language Models: A Survey

Large Language Model based Situational Dialogues for Second Language Learning

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots