Large Language Models and Medical Knowledge Grounding for Diagnosis Prediction

Yanjun Gao,Ruizhe Li,Emma Croxford,Samuel Tesch,Daniel To,John Caskey,Brian W Patterson,Matthew M Churpek,Timothy Miller,Dmitriy Dligach,Majid Afshar
DOI: https://doi.org/10.1101/2023.11.24.23298641
2024-06-18
Abstract:While large language models (LLMs) have showcased their potential in diverse language tasks, their application in the healthcare arena needs to ensure the minimization of diagnostic errors and the prevention of patient harm. A medical knowledge graph (KG) houses a wealth of structured medical concept relations sourced from authoritative references, such as UMLS, making it a valuable resource to ground LLM diagnostic process in knowledge. In this paper, we examine the synergistic potential of LLMs and medical KG in predicting diagnoses given electronic health records (EHR), under the framework of Retrieval-augmented generation (RAG). We proposed a novel graph model: DR.KNOWS, that selects the most relevant pathology knowledge paths based on the medical problem descriptions. In order to evaluate DR.KNOWS, we developed the first comprehensive human evaluation approach to assess the performance of LLMs for diagnosis prediction and examine the rationale behind their decision-making processes, aimed at improving diagnostic safety. Using real-world hospital datasets, our study serves to enrich the discourse on the role of medical KGs in grounding medical knowledge into LLMs, revealing both challenges and opportunities in harnessing external knowledge for explainable diagnostic pathway and the realization of AI-augmented diagnostic decision support systems.
Health Informatics
What problem does this paper attempt to address?
The paper primarily aims to address the following issues: ### Core Issues - **Improving Diagnostic Accuracy and Reliability**: When using large language models (LLMs) for automatic disease diagnosis, how to ensure high accuracy and reliability of the diagnostic results to avoid potential life-threatening situations. ### Solution Exploration - **Integrating Medical Knowledge Graphs**: Research on how to integrate medical knowledge graphs (Medical Knowledge Graphs, KGs) into large language models to enhance the model's performance in diagnostic generation tasks and improve its interpretability. - **Developing New Graph Models**: Proposing a new graph model named DR.KNOWS, which is used to identify the most relevant pathological knowledge paths from authoritative sources like the Unified Medical Language System (UMLS). - **Evaluation Framework Design**: Designing the first comprehensive human evaluation framework to assess the performance of LLMs in disease prediction and to deeply explore the rationale behind these models' decision-making processes, with a particular focus on improving diagnostic safety. ### Specific Objectives 1. **Evaluate DR.KNOWS**: Assess the ability of DR.KNOWS to select the most likely diagnoses and their interpretable paths. 2. **Design and Implement a Human Evaluation Framework**: Design and implement the first dedicated human evaluation framework for the diagnostic and reasoning outputs generated by LLMs. 3. **Explore the Role of Knowledge Graphs**: Investigate the integration of knowledge graphs as an additional module into LLMs to enhance the effectiveness of disease-related diagnostic generation. 4. **Validate the Effectiveness of the Human Evaluation Framework**: Demonstrate the practicality of the proposed human evaluation framework in revealing key aspects of LLMs' diagnostic performance and ensuring diagnostic safety. ### Main Contributions - Proposed the DR.KNOWS graph model to select the most relevant knowledge paths from the UMLS knowledge graph for specific cases. - Designed the first dedicated human evaluation framework for LLMs' diagnostic generation and reasoning. - Demonstrated through empirical analysis the impact of knowledge graphs on improving diagnostic abstraction and correct reasoning capabilities. - Validated that the proposed evaluation framework effectively reveals the strengths and weaknesses of LLMs in diagnostic generation, aiding future model improvements. In summary, this paper is dedicated to exploring how to improve the accuracy and reliability of automatic diagnosis by combining large language models and medical knowledge graphs, and has designed an innovative human evaluation framework for this purpose.