Joint Learning of Context and Feedback Embeddings in Spoken Dialogue

Livia Qian,Gabriel Skantze
2024-06-11
Abstract:Short feedback responses, such as backchannels, play an important role in spoken dialogue. So far, most of the modeling of feedback responses has focused on their timing, often neglecting how their lexical and prosodic form influence their contextual appropriateness and conversational function. In this paper, we investigate the possibility of embedding short dialogue contexts and feedback responses in the same representation space using a contrastive learning objective. In our evaluation, we primarily focus on how such embeddings can be used as a context-feedback appropriateness metric and thus for feedback response ranking in U.S. English dialogues. Our results show that the model outperforms humans given the same ranking task and that the learned embeddings carry information about the conversational function of feedback responses.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of modeling feedback responses in Spoken Dialogue Systems (SDS), especially how to take into account the dialogue context and convey appropriate communicative functions when generating feedback. Specifically: 1. **Limitations of existing research**: - At present, most of the modeling of feedback responses mainly focuses on their timing, ignoring how their lexical form and prosodic form affect their situational appropriateness and dialogue functions. - Existing Text - to - Speech (TTS) and Automatic Speech Recognition (ASR) technologies mainly focus on the voice modeling of the main channel, often ignoring backchannels and human - like conversations. 2. **Research objectives**: - **Joint embedding learning**: Through the contrastive learning method, embed short - dialogue contexts and feedback responses into the same representation space, thereby capturing the relationship between them. - **Evaluating the effectiveness of embeddings**: Verify whether these embeddings can be used as a measure of context - feedback appropriateness and be used for the ranking task of feedback responses. - **Exploring unsupervised learning**: Learn the functional representation of feedback responses through unsupervised methods without relying on manually labeled tags. 3. **Practical applications**: - This model can help generate more natural, context - compliant feedback responses, thereby improving the interaction quality of spoken dialogue systems. This can be achieved by directly ranking the synthesized feedback candidates or by classifying appropriate feedback functions to guide the synthesis process. ### Specific problem description - **How to generate context - appropriate feedback responses**: By jointly training the embedding representations of dialogue contexts and feedback responses, make the generated feedback more natural and in line with the dialogue scenario. - **How to evaluate the quality of feedback responses**: By calculating the cosine similarity between the context embedding and the feedback embedding, evaluate whether the generated feedback response is appropriate. - **How to handle multimodal information**: Combine audio and text information to more comprehensively capture the characteristics of feedback responses. Especially for short feedback responses with limited lexical content, intonation information is particularly important. ### Summary This paper solves the problem of ignoring the form of feedback responses and their dialogue functions in existing research by introducing the contrastive learning method to embed dialogue contexts and feedback responses into the same representation space. The experimental results show that this model outperforms human performance in the feedback response ranking task, especially when using audio and text information. This provides new ideas and technical means for generating more natural and context - compliant feedback responses.