ViMQ: A Vietnamese Medical Question Dataset for Healthcare Dialogue System Development

Ta Duc Huy,Nguyen Anh Tu,Tran Hoang Vu,Nguyen Phuc Minh,Nguyen Phan,Trung H. Bui,Steven Q. H. Truong
DOI: https://doi.org/10.1007/978-3-030-92310-5_76
2023-04-28
Abstract:Existing medical text datasets usually take the form of question and answer pairs that support the task of natural language generation, but lacking the composite annotations of the medical terms. In this study, we publish a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. The tag sets for two tasks are in medical domain and can facilitate the development of task-oriented healthcare chatbots with better comprehension of queries from patients. We train baseline models for the two tasks and propose a simple self-supervised training strategy with span-noise modelling that substantially improves the performance. Dataset and code will be published at <a class="link-external link-https" href="https://github.com/tadeephuy/ViMQ" rel="external noopener nofollow">this https URL</a>
Computation and Language
What problem does this paper attempt to address?
The problems this paper attempts to address are: Existing medical text datasets typically support natural language generation tasks in the form of question-answer pairs but lack compound annotations for medical terms. To address this, the authors have released a Vietnamese medical question dataset (ViMQ), which includes sentence-level and entity-level annotations for intent classification and named entity recognition tasks. These label sets belong to the medical field and can facilitate the development of task-oriented medical chatbots, enhancing their ability to understand patient queries. Specifically, the main contributions of the paper include: 1. The release of a Vietnamese medical question dataset (ViMQ) for the development of medical chatbots. 2. The proposal of a training strategy that improves the performance of named entity recognition tasks by introducing span noise modeling. The dataset and code will be publicly available on GitHub for use by other researchers and developers.