Abstract:Developing conversational agents to interact with patients and provide primary clinical advice has attracted increasing attention due to its huge application potential, especially in the time of COVID-19 Pandemic. However, the training of end-to-end neural-based medical dialogue system is restricted by an insufficient quantity of medical dialogue corpus. In this work, we make the first attempt to build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG, with more than 17K conversations collected from the online health consultation community. Five different categories of entities, including diseases, symptoms, attributes, tests, and medicines, are annotated in each conversation of MedDG as additional labels. To push forward the future research on building expert-sensitive medical dialogue system, we proposes two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation. To acquire a clear comprehension on these two medical dialogue tasks, we implement several state-of-the-art benchmarks, as well as design two dialogue models with a further consideration on the predicted entities. Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset, and the response quality can be enhanced with the help of auxiliary entity information. From human evaluation, the simple retrieval model outperforms several state-of-the-art generative models, indicating that there still remains a large room for improvement on generating medically meaningful responses.

MedDialog: Two Large-scale Medical Dialogue Datasets

MedDialog: A Large-scale Medical Dialogue Dataset

OpenViDial 2.0: A Larger-Scale, Open-Domain Dialogue Generation Dataset with Visual Contexts

On the Generation of Medical Dialogues for COVID-19

MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation

DialMed: A Dataset for Dialogue-based Medication Recommendation

CMCC: A Comprehensive and Large-Scale Human-Human Dataset for Dialogue Systems

MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation.

A benchmark for automatic medical consultation system: frameworks, tasks and datasets

CDialog: A Multi-turn Covid-19 Conversation Dataset for Entity-Aware Dialog Generation

XDailyDialog: A Multilingual Parallel Dialogue Corpus

LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

Medical Dialogue: A Survey of Categories, Methods, Evaluation and Challenges

MidMed: Towards Mixed-Type Dialogues for Medical Consultation

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

SuperDialseg: A Large-scale Dataset for Supervised Dialogue Segmentation

MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations

A Large-Scale Chinese Short-Text Conversation Dataset

DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue Assessment

Audio Dialogues: Dialogues dataset for audio and music understanding

CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog Evaluation