Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

Ran Xu,Hejie Cui,Yue Yu,Xuan Kan,Wenqi Shi,Yuchen Zhuang,Wei Jin,Joyce Ho,Carl Yang

2023-11-01

Abstract:Clinical natural language processing requires methods that can address domain-specific challenges, such as complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation using LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 7 clinical NLP tasks and 16 datasets reveals that ClinGen consistently enhances performance across various tasks, effectively aligning the distribution of real datasets and significantly enriching the diversity of generated training instances. We will publish our code and all the generated data in \url{<a class="link-external link-https" href="https://github.com/ritaranx/ClinGen" rel="external noopener nofollow">this https URL</a>}.

Computation and Language,Artificial Intelligence,Machine Learning,Quantitative Methods

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: in the field of Clinical Natural Language Processing (Clinical NLP), how to effectively generate high - quality synthetic clinical text data to overcome the limitations of existing methods in terms of distribution consistency, diversity, and privacy and resource efficiency. Specifically: 1. **Distribution Consistency Problem**: The synthetic data generated by existing large - language - models (LLMs) often have significant differences in distribution from the real data (i.e., distribution shift). This causes the generated data to not well simulate the data distribution in actual application scenarios, thus affecting the performance of downstream tasks. 2. **Lack of Diversity**: The synthetic data generated by existing methods have obvious deficiencies in the number and frequency of entities and cannot fully reflect the complexity and diversity of real - world clinical data. This limits the effectiveness and practicality of the generated data. 3. **Privacy and Resource Efficiency Problems**: Direct application of large - language - models for inference brings high computational costs and privacy risks, especially when dealing with clinical texts containing sensitive patient information. Therefore, a method that can protect privacy and efficiently utilize resources is needed to generate synthetic data. To solve these problems, the authors propose a framework named C LINGEN. This framework generates high - quality and diverse synthetic clinical text data by integrating clinical knowledge into the prompting process and using external knowledge graphs (KGs) and large - language - models (LLMs). This method not only improves the quality of the generated data but also ensures that its distribution is closer to the real data and enhances the diversity of generated instances. In addition, C LINGEN only depends on a small amount of additional human input and can be widely applied to various core clinical NLP tasks. ### Formula Explanation There are no specific mathematical formulas involved in the paper, but some evaluation metrics are mentioned, such as Central Moment Discrepancy (CMD) and performance gain percentage, etc. These metrics are used to quantify the distribution differences between synthetic data and real data and the model performance improvement. For example: - **CMD** is used to measure the distribution gap between synthetic data and real data. - **Performance Gain** represents the performance improvement of the model trained with the synthetic data generated by C LINGEN relative to the baseline method. These evaluation metrics help to verify the effectiveness and superiority of C LINGEN.

Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

Improving Clinical NLP Performance through Language Model-Generated Synthetic Clinical Data

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Retrieval-Reasoning Large Language Model-based Synthetic Clinical Trial Generation

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

A study of generative large language model for medical research and healthcare

Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need

ChatGPT-HealthPrompt. Harnessing the Power of XAI in Prompt-Based Healthcare Decision Support using ChatGPT

KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

Improving Clinical Note Generation from Complex Doctor-Patient Conversation

Generative Large Language Models in Electronic Health Records for Patient Care Since 2023: A Systematic Review

Enhancing Small Medical Learners with Privacy-preserving Contextual Prompting

Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering

Biomedical knowledge graph-optimized prompt generation for large language models

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

Large language models encode clinical knowledge

Leveraging A Medical Knowledge Graph into Large Language Models for Diagnosis Prediction