Abstract:Objective: Traditional knowledge-based and machine learning diagnostic decision support systems have benefited from integrating the medical domain knowledge encoded in the Unified Medical Language System (UMLS). The emergence of Large Language Models (LLMs) to supplant traditional systems poses questions of the quality and extent of the medical knowledge in the models' internal knowledge representations and the need for external knowledge sources. The objective of this study is three-fold: to probe the diagnosis-related medical knowledge of popular LLMs, to examine the benefit of providing the UMLS knowledge to LLMs (grounding the diagnosis predictions), and to evaluate the correlations between human judgments and the UMLS-based metrics for generations by LLMs. Methods: We evaluated diagnoses generated by LLMs from consumer health questions and daily care notes in the electronic health records using the ConsumerQA and Problem Summarization datasets. Probing LLMs for the UMLS knowledge was performed by prompting the LLM to complete the diagnosis-related UMLS knowledge paths. Grounding the predictions was examined in an approach that integrated the UMLS graph paths and clinical notes in prompting the LLMs. The results were compared to prompting without the UMLS paths. The final experiments examined the alignment of different evaluation metrics, UMLS-based and non-UMLS, with human expert evaluation. Results: In probing the UMLS knowledge, GPT-3.5 significantly outperformed Llama2 and a simple baseline yielding an F1 score of 10.9% in completing one-hop UMLS paths for a given concept. Grounding diagnosis predictions with the UMLS paths improved the results for both models on both tasks, with the highest improvement (4%) in SapBERT score. There was a weak correlation between the widely used evaluation metrics (ROUGE and SapBERT) and human judgments. Conclusion: We found that while popular LLMs contain some medical knowledge in their internal representations, augmentation with the UMLS knowledge provides performance gains around diagnosis generation. The UMLS needs to be tailored for the task to improve the LLMs predictions. Finding evaluation metrics that are aligned with human judgments better than the traditional ROUGE and BERT-based scores remains an open research question.

Leveraging A Medical Knowledge Graph into Large Language Models for Diagnosis Prediction

Leveraging A Medical Knowledge Graph into Large Language Models for Diagnosis Prediction

Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction

Large Language Models for Biomedical Knowledge Graph Construction: Information extraction from EMR notes

medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs

Think and Retrieval: A Hypothesis Knowledge Graph Enhanced Medical Large Language Models

Critical Care Studies Using Large Language Models Based on Electronic Healthcare Records: A Technical Note

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Integrating Automated Knowledge Extraction with Large Language Models for Explainable Medical Decision-Making

Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

Leveraging Medical Knowledge Graphs and Large Language Models for Enhanced Mental Disorder Information Extraction

MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding

KARGEN: Knowledge-enhanced Automated Radiology Report Generation Using Large Language Models

Integrated Application of LLM Model and Knowledge Graph in Medical Text Mining and Knowledge Extraction

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Augmented non-hallucinating large language models as medical information curators

KG-Rank: Enhancing Large Language Models for Medical QA with Knowledge Graphs and Ranking Techniques

Demystifying Large Language Models for Medicine: A Primer