Abstract:Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of modeling feedback responses in Spoken Dialogue Systems (SDS), especially how to take into account the dialogue context and convey appropriate communicative functions when generating feedback. Specifically: 1. **Limitations of existing research**: - At present, most of the modeling of feedback responses mainly focuses on their timing, ignoring how their lexical form and prosodic form affect their situational appropriateness and dialogue functions. - Existing Text - to - Speech (TTS) and Automatic Speech Recognition (ASR) technologies mainly focus on the voice modeling of the main channel, often ignoring backchannels and human - like conversations. 2. **Research objectives**: - **Joint embedding learning**: Through the contrastive learning method, embed short - dialogue contexts and feedback responses into the same representation space, thereby capturing the relationship between them. - **Evaluating the effectiveness of embeddings**: Verify whether these embeddings can be used as a measure of context - feedback appropriateness and be used for the ranking task of feedback responses. - **Exploring unsupervised learning**: Learn the functional representation of feedback responses through unsupervised methods without relying on manually labeled tags. 3. **Practical applications**: - This model can help generate more natural, context - compliant feedback responses, thereby improving the interaction quality of spoken dialogue systems. This can be achieved by directly ranking the synthesized feedback candidates or by classifying appropriate feedback functions to guide the synthesis process. ### Specific problem description - **How to generate context - appropriate feedback responses**: By jointly training the embedding representations of dialogue contexts and feedback responses, make the generated feedback more natural and in line with the dialogue scenario. - **How to evaluate the quality of feedback responses**: By calculating the cosine similarity between the context embedding and the feedback embedding, evaluate whether the generated feedback response is appropriate. - **How to handle multimodal information**: Combine audio and text information to more comprehensively capture the characteristics of feedback responses. Especially for short feedback responses with limited lexical content, intonation information is particularly important. ### Summary This paper solves the problem of ignoring the form of feedback responses and their dialogue functions in existing research by introducing the contrastive learning method to embed dialogue contexts and feedback responses into the same representation space. The experimental results show that this model outperforms human performance in the feedback response ranking task, especially when using audio and text information. This provides new ideas and technical means for generating more natural and context - compliant feedback responses.

Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Dialogue Learning with Human-in-the-Loop.

Deep Reinforcement Learning for Dialogue Generation

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

Conditional Joint Model For Spoken Dialogue System

Post-encoding and Contrastive Learning Method for Response Selection Task

Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation

Multi-dimensional Evaluation of Empathetic Dialog Responses

Small Changes Make Big Differences: Improving Multi-turn Response Selection in Dialogue Systems Via Fine-Grained Contrastive Learning

Learning Dialogue History for Spoken Language Understanding.

Leveraging Implicit Feedback from Deployment Data in Dialogue

User-LLM: Efficient LLM Contextualization with User Embeddings

How to Represent Context Better? an Empirical Study on Context Modeling for Multi-turn Response Selection.

Learning to Bridge Metric Spaces: Few-shot Joint Learning of Intent Detection and Slot Filling.

Computational models of tutor feedback in language acquisition

A Self-Attention Joint Model for Spoken Language Understanding in Situational Dialog Applications

Dial2vec: Self-Guided Contrastive Learning of Unsupervised Dialogue Embeddings

An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation.

Learning Contextual Language Embeddings for Monaural Multi-Talker Speech Recognition.

Learning from Naturally Occurring Feedback

Contextual Knowledge Learning For Dialogue Generation