Abstract:With the development of Natural Language Processing, Automatic question-answering system such as Waston, Siri, Alexa, has become one of the most important NLP applications. Nowadays, enterprises try to build automatic custom service chatbots to save human resources and provide a 24-hour customer service. Evaluation of chatbots currently relied greatly on human annotation which cost a plenty of time. Thus, has initiated a new Short Text Conversation subtask called Dialogue Quality (DQ) and Nugget Detection (ND) which aim to automatically evaluate dialogues generated by chatbots. In this paper, we solve the DQ and ND subtasks by deep neural network. We proposed two models for both DQ and ND subtasks which is constructed by hierarchical structure: embedding layer, utterance layer, context layer and memory layer, to hierarchical learn dialogue representation from word level, sentence level, context level to long range context level. Furthermore, we apply gating and attention mechanism at utterance layer and context layer to improve the performance. We also tried BERT to replace embedding layer and utterance layer as sentence representation. The result shows that BERT produced a better utterance representation than multi-stack CNN for both DQ and ND subtasks and outperform other models proposed by other researches. The evaluation measures are proposed by , that is, NMD, RSNOD for DQ and JSD, RNSS for ND, which is not traditional evaluation measures such as accuracy, precision, recall and f1-score. Thus, we have done a series of experiments by using traditional evaluation measures and analyze the performance and error.

Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

Boosting Naturalness of Language in Task-oriented Dialogues via Adversarial Training

Estimating Subjective Crowd-Evaluations as an Additional Objective to Improve Natural Language Generation

Task-Oriented Dialogue System as Natural Language Generation

Generating Persona Consistent Dialogues by Exploiting Natural Language Inference

Data-driven Natural Language Generation: Paving the Road to Success

On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems

LLM-based NLG Evaluation: Current Status and Challenges

Short Text Conversation Based on Deep Neural Network and Analysis on Evaluation Measures

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks

Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks

Investigating Human-Computer Interaction and Visual Comprehension in Text Generation Process of Natural Language Generation Models

Context-aware Natural Language Generation for Spoken Dialogue Systems.

Uncertainty-aware Automatic Evaluation Method for Open-domain Dialogue Systems

A Survey of Natural Language Generation

Development of a Human Evaluation Framework and Correlation with Automated Metrics for Natural Language Generation of Medical Diagnoses

MENLI: Robust Evaluation Metrics from Natural Language Inference

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach