Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

Ran Xu,Hejie Cui,Yue Yu,Xuan Kan,Wenqi Shi,Yuchen Zhuang,Wei Jin,Joyce Ho,Carl Yang
2023-11-01
Abstract:Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. We will publish our code and all the generated data in \url{<a class="link-external link-https" href="https://github.com/ritaranx/ClinGen" rel="external noopener nofollow">this https URL</a>}.
Computation and Language,Artificial Intelligence,Machine Learning,Quantitative Methods
What problem does this paper attempt to address?
The problems that this paper attempts to solve are: in the field of Clinical Natural Language Processing (Clinical NLP), how to effectively generate high - quality synthetic clinical text data to overcome the limitations of existing methods in terms of distribution consistency, diversity, and privacy and resource efficiency. Specifically: 1. **Distribution Consistency Problem**: The synthetic data generated by existing large - language - models (LLMs) often have significant differences in distribution from the real data (i.e., distribution shift). This causes the generated data to not well simulate the data distribution in actual application scenarios, thus affecting the performance of downstream tasks. 2. **Lack of Diversity**: The synthetic data generated by existing methods have obvious deficiencies in the number and frequency of entities and cannot fully reflect the complexity and diversity of real - world clinical data. This limits the effectiveness and practicality of the generated data. 3. **Privacy and Resource Efficiency Problems**: Direct application of large - language - models for inference brings high computational costs and privacy risks, especially when dealing with clinical texts containing sensitive patient information. Therefore, a method that can protect privacy and efficiently utilize resources is needed to generate synthetic data. To solve these problems, the authors propose a framework named C LINGEN. This framework generates high - quality and diverse synthetic clinical text data by integrating clinical knowledge into the prompting process and using external knowledge graphs (KGs) and large - language - models (LLMs). This method not only improves the quality of the generated data but also ensures that its distribution is closer to the real data and enhances the diversity of generated instances. In addition, C LINGEN only depends on a small amount of additional human input and can be widely applied to various core clinical NLP tasks. ### Formula Explanation There are no specific mathematical formulas involved in the paper, but some evaluation metrics are mentioned, such as Central Moment Discrepancy (CMD) and performance gain percentage, etc. These metrics are used to quantify the distribution differences between synthetic data and real data and the model performance improvement. For example: - **CMD** is used to measure the distribution gap between synthetic data and real data. - **Performance Gain** represents the performance improvement of the model trained with the synthetic data generated by C LINGEN relative to the baseline method. These evaluation metrics help to verify the effectiveness and superiority of C LINGEN.