Abstract:With the development of Natural Language Processing, Automatic question-answering system such as Waston, Siri, Alexa, has become one of the most important NLP applications. Nowadays, enterprises try to build automatic custom service chatbots to save human resources and provide a 24-hour customer service. Evaluation of chatbots currently relied greatly on human annotation which cost a plenty of time. Thus, has initiated a new Short Text Conversation subtask called Dialogue Quality (DQ) and Nugget Detection (ND) which aim to automatically evaluate dialogues generated by chatbots. In this paper, we solve the DQ and ND subtasks by deep neural network. We proposed two models for both DQ and ND subtasks which is constructed by hierarchical structure: embedding layer, utterance layer, context layer and memory layer, to hierarchical learn dialogue representation from word level, sentence level, context level to long range context level. Furthermore, we apply gating and attention mechanism at utterance layer and context layer to improve the performance. We also tried BERT to replace embedding layer and utterance layer as sentence representation. The result shows that BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other models proposed by other researches. The evaluation measures are proposed by , that is, NMD, RSNOD for DQ and JSD, RNSS for ND, which is not traditional evaluation measures such as accuracy, precision, recall and f1-score. Thus, we have done a series of experiments by using traditional evaluation measures and analyze the performance and error.

Towards Automatic Evaluation of Customer-Helpdesk Dialogues

Test Collections and Measures for Evaluating Customer-Helpdesk Dialogues.

Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measures

Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling

Turn-level Dialog Evaluation with Dialog-level Weak Signals for Bot-Human Hybrid Customer Service Systems

Multi-dimensional Evaluation of Empathetic Dialog Responses

Simulating User Satisfaction for the Evaluation of Task-oriented Dialogue Systems

DialogQAE: N-to-N Question Answer Pair Extraction from Customer Service Chatlog.

The First Evaluation of Chinese Human-Computer Dialogue Technology

An Evaluation of Chinese Human-Computer Dialogue Technology

Toward More Accurate and Generalizable Evaluation Metrics for Task-Oriented Dialogs

How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

BotEval: Facilitating Interactive Human Evaluation

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents