Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes

Youngrong Lee,Chansik Kim,Taehoon Ko
DOI: https://doi.org/10.3233/SHTI240497
2024-08-22
Abstract:This study introduces a novel approach for generating machine-generated instruction datasets for fine-tuning medical-specialized language models using MIMIC-IV discharge records. The study created a large-scale text dataset comprising instructions, cropped discharge notes as inputs, and outputs in JSONL format. The dataset was generated through three main stages, generating instruction and output using seed tasks provided by medical experts, followed by invalid data filtering. The generated dataset consisted of 51,385 sets, with mean ROUGE between seed tasks of 0.185. Evaluation of the generated dataset were promising, with high validity rates determined by both GPT-3.5 and a human annotator (88.0% and 88.5% respectively). The study highlights the potential of automating dataset creation for NLP tasks in the medical domain.
What problem does this paper attempt to address?