LCMDC: Large-scale Chinese Medical Dialogue Corpora for Automatic Triage and Medical Consultation

Xinyuan Wang,Haozhou Li,Dingfang Zheng,Qinke Peng
2024-09-27
Abstract:The global COVID-19 pandemic underscored major deficiencies in traditional healthcare systems, hastening the advancement of online medical services, especially in medical triage and consultation. However, existing studies face two main challenges. First, the scarcity of large-scale, publicly available, domain-specific medical datasets due to privacy concerns, with current datasets being small and limited to a few diseases, limiting the effectiveness of triage methods based on Pre-trained Language Models (PLMs). Second, existing methods lack medical knowledge and struggle to accurately understand professional terms and expressions in patient-doctor consultations. To overcome these obstacles, we construct the Large-scale Chinese Medical Dialogue Corpora (LCMDC), comprising a Coarse-grained Triage dataset with 439,630 samples, a Fine-grained Diagnosis dataset with 199,600 samples, and a Medical Consultation dataset with 472,418 items, thereby addressing the data shortage in this field. Moreover, we further propose a novel triage system that combines BERT-based supervised learning with prompt learning, as well as a GPT-based medical consultation model using reinforcement learning. To enhance domain knowledge acquisition, we pre-trained PLMs using our self-constructed background corpus. Experimental results on the LCMDC demonstrate the efficacy of our proposed systems.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address two major challenges faced in online medical services for triage and consultation: 1. **Lack of large-scale public medical datasets**: - Due to privacy issues, it is very difficult to obtain large-scale, public datasets for medical triage and consultation. Existing datasets are small in scale and cover only a few diseases, which limits the effectiveness of triage methods based on pre-trained language models (PLMs). 2. **Lack of medical knowledge in existing methods**: - Existing methods have difficulty understanding and processing professional terms and expressions in patient-doctor dialogues because they lack domain-specific knowledge. To overcome these obstacles, the authors constructed a large-scale Chinese medical dialogue corpus (LCMDC) and proposed a new triage system combining supervised learning and prompt learning, as well as a GPT-2-based medical consultation model using reinforcement learning. They pre-trained the language model with a self-built background corpus to enhance the acquisition of domain knowledge. ### Main Contributions 1. **Construction of a large-scale Chinese medical dialogue corpus**: - Includes a coarse-grained triage dataset (439,630 samples), a fine-grained diagnosis dataset (199,600 samples), and a medical consultation dataset (472,418 samples), providing a foundation for automatic triage and medical consultation systems. 2. **Proposing information fusion classification algorithm and prompt learning classification algorithm**: - By combining supervised learning and prompt learning methods, the accuracy of intelligent triage is improved, especially for the classification of rare diseases, with an accuracy increase of 5%. 3. **Development of a dialogue model construction framework**: - Includes knowledge fusion and input supplementation, enhancing the performance of multiple evaluation metrics. ### Dataset Construction 1. **Data Collection**: - Crawled millions of doctor-patient dialogue records from the online medical consultation platform "Quick Doctor" to construct LCMDC. 2. **Intelligent Triage Dataset**: - Includes a coarse-grained triage dataset (14 categories, 439,630 samples) and a fine-grained diagnosis dataset (120 categories, 199,600 samples) for automatic medical triage and diagnosis. 3. **Medical Consultation Dataset**: - Includes 472,418 question-answer pairs for medical dialogue generation tasks. ### Intelligent Triage System 1. **Supervised Learning Triage System**: - Uses pre-trained language models (such as BERT) for supervised learning, integrates information through bidirectional LSTM networks and dendritic networks, and finally obtains classification distribution through a single-layer perceptron and Softmax function. 2. **Prompt Learning Triage System**: - Utilizes the knowledge in pre-trained language models to improve classification performance on small sample datasets through prompt learning methods. ### Medical Consultation System 1. **Problem Definition**: - Models the construction of the consultation system as a text generation task, generating doctor responses based on input questions and a knowledge graph. 2. **System Overview**: - Based on the GPT-2 model, retrains with medical encyclopedia and consultation data to internalize knowledge. Constructs a medical knowledge graph to provide external information as supplementary input. Finally, trains the model with dialogue data and fine-tunes it through reinforcement learning. ### Experimental Results 1. **Supervised Learning Triage System Experiments**: - The proposed supervised learning text classification algorithm outperforms machine learning and deep neural network methods on multiple evaluation metrics, especially on the fine-grained dataset. 2. **Prompt Learning Triage System Experiments**: - The prompt learning method outperforms supervised learning in classification performance on small sample datasets but performs slightly worse on large-scale datasets. 3. **Medical Consultation System Experiments**: - The proposed model outperforms traditional deep learning sequence-to-sequence models and the BART model on multiple evaluation metrics, especially in terms of TER and BLEU scores. ### Conclusion This paper effectively addresses the issues of data scarcity and lack of domain knowledge in online medical services by constructing a large-scale Chinese medical dialogue corpus and proposing new triage and consultation systems, providing strong support for intelligent medical triage and dialogue generation.