A Comprehensive Data Preprocessing Framework Towards Improving Internet Chinese Medical Data Quality

Chong Zhang,Yibing Zhan,Yunzhou Zhong,Jun Ni,Jianqing Zhu,Changtong Zan,Dapeng Tao
DOI: https://doi.org/10.1109/iccea62105.2024.10603802
2024-01-01
Abstract:Medical large language models (MLLMs) have attracted increasing attention recently. Data is the key to building MLLMs, and the most commonly used manner is to obtain data from online healthcare platforms. Raw Internet data contains various types of noise; however, current data preprocessing methods are costly, incomplete, and ineffective. In this paper, we propose a comprehensive data preprocessing framework to reduce the noise in Internet data as much as possible. Specifically, our framework divides noise into four categories: chaotic data format, low data quality, data duplication, and personal privacy, and designs four modules to reduce each type of data noise, respectively. First, all data must pass through the data unification module to ensure that the subsequent processing can have a stable data form. Then, keyword matching, text statistics, and metric features are designed in a quality filtering module to detect and eliminate low-quality elements. Subsequently, a data deduplication module is developed to remove redundancy from the data at the text level and line level, alleviating potential interference with model training. Lastly, personal identity information will be eliminated to ensure the protection of user privacy. To validate the usefulness of our data preprocessing framework, we select the MedDialog-CN dataset, a typical Internet Chinese Medical dataset, as a testbed with three typical language models: BERT-GPT, DialoGPT, and Transformer. According to automatic and manual experiments, our data preprocessing framework can filter 26.84% of the data with noise in MedDialog-CN, and the performance of all methods is improved using our framework.
What problem does this paper attempt to address?