Abstract:The dialogue systems in customer services have been developed with neural models to provide users with precise answers and round-the-clock support in task-oriented conversations by detecting customer intents based on their utterances. Existing intent detection approaches have highly relied on adaptively pre-training language models with large-scale datasets, yet the predominant cost of data collection may hinder their superiority. In addition, they neglect the information within the conversational responses of the agents, which have a lower collection cost, but are significant to customer intent as agents must tailor their replies based on the customers' intent. In this paper, we propose RSVP, a self-supervised framework dedicated to task-oriented dialogues, which utilizes agent responses for pre-training in a two-stage manner. Specifically, we introduce two pre-training tasks to incorporate the relations of utterance-response pairs: 1) Response Retrieval by selecting a correct response from a batch of candidates, and 2) Response Generation by mimicking agents to generate the response to a given utterance. Our benchmark results for two real-world customer service datasets show that RSVP significantly outperforms the state-of-the-art baselines by 4.95% for accuracy, 3.4% for MRR@3, and 2.75% for MRR@5 on average. Extensive case studies are investigated to show the validity of incorporating agent responses into the pre-training stage.

What problem does this paper attempt to address?

The paper aims to address the issue of customer intent detection in task-oriented dialogue systems, particularly in customer service scenarios. Specifically, the paper proposes a new framework called RSVP (Request and Service Via Pre-training), which improves the understanding and recognition of customer intent by leveraging the response information from customer service agents. Traditional methods for handling customer intent detection typically rely on large-scale, high-quality annotated data for pre-training, which is not only costly but also time-consuming. Moreover, most studies focus solely on the customer's utterance, neglecting the value of the customer service agent's response. The paper points out that the agent's response contains important clues about the customer's intent, and these response data are more easily accessible and do not require additional annotation. Therefore, the goal of the RSVP framework is to fully utilize the agent's responses through the following two stages: 1. **Pre-training Stage**: - **Response Retrieval**: Select the correct agent response from a set of candidate responses to enhance the model's ability to distinguish between correct and incorrect responses. - **Response Generation**: Mimic the agent in generating responses, improving the model's understanding of agent responses by directly learning how to respond to customer utterances. 2. **Fine-tuning Stage**: Apply the pre-trained model to specific intent detection tasks to further optimize model performance. In this way, RSVP not only reduces the dependency on external large annotated datasets but also effectively utilizes the metadata within the internal customer service dialogue system (i.e., the agent's responses), thereby improving the accuracy of customer intent detection. Experimental results show that RSVP significantly outperforms existing baseline methods on multiple real-world datasets.

RSVP: Customer Intent Detection via Agent Response Contrastive and Generative Pre-Training

Training Neural Response Selection for Task-Oriented Dialogue Systems

Learning an Effective Context-Response Matching Model with Self-Supervised Tasks for Retrieval-based Dialogues

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

CORE: Cooperative Training of Retriever-Reranker for Effective Dialogue Response Selection

A Survey on Response Selection for Retrieval-based Dialogues.

DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation

Exploring Dense Retrieval for Dialogue Response Selection

Learning to Detect Relevant Contexts and Knowledge for Response Selection in Retrieval-based Dialogue Systems

An In-depth Investigation of User Response Simulation for Conversational Search.

Joint Intent Detection Model for Task-oriented Human-Computer Dialogue System Using Asynchronous Training

Improving Contextual Language Models for Response Retrieval in Multi-Turn Conversation

New Intent Discovery with Pre-training and Contrastive Learning

Building an Efficient Retrieval-based Dialogue System with Contrastive Learning

RAP-Net: Recurrent Attention Pooling Networks for Dialogue Response Selection

Learning to Expand: Reinforced Pseudo-relevance Feedback Selection for Information-seeking Conversations

Building an Efficient and Effective Retrieval-based Dialogue System via Mutual Learning

EM Pre-training for Multi-party Dialogue Response Generation

DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder

Towards Robust Online Dialogue Response Generation

Deep context modeling for multi-turn response selection in dialogue systems