Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Simone Kresevic,Mauro Giuffrè,Milos Ajcevic,Agostino Accardo,Lory S. Crocè,Dennis L. Shung
DOI: https://doi.org/10.1038/s41746-024-01091-y
IF: 15.2
2024-04-24
npj Digital Medicine
Abstract:Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI's GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM's answer generation accuracy. The baseline GPT-4 Turbo model's performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% ( p < 0.001), when guidelines were provided as context in a coherent corpus of text and non-text sources were converted into text. In addition, few-shot learning did not seem to improve overall accuracy. The study highlights that structured guideline reformatting and advanced prompt engineering (data quality vs. data quantity) can enhance the efficacy of LLM integrations to CDSSs for guideline delivery.
health care sciences & services,medical informatics
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to optimize the interpretation of liver disease clinical guidelines through large language models (LLMs) to improve the accuracy of clinical decision support systems (CDSSs) for managing chronic hepatitis C virus (HCV) infection. Specifically, the study focuses on utilizing OpenAI's GPT-4 Turbo model to develop a custom LLM framework that combines retrieval-augmented generation (RAG) and prompt engineering to enhance the accurate interpretation of medical guidelines. ### Main Research Content 1. **Background and Motivation**: - LLMs have significant potential in healthcare, particularly in providing timely and accurate information within hospital workflows. - The study explores integrating LLMs into healthcare, specifically by improving CDSSs through the accurate interpretation of medical guidelines. - The research employs OpenAI's GPT-4 Turbo model to develop a custom LLM framework that includes RAG and prompt engineering. 2. **Methods**: - Transforming guidelines into a structured format best suited for LLMs to provide the most accurate output. - Conducting ablation studies to evaluate the impact of different formats and learning strategies on the accuracy of LLM-generated answers. - Comparing the performance of the baseline GPT-4 Turbo model with five experimental setups that progressively increase complexity, including the inclusion of contextual guidelines, guideline reformatting, and the implementation of few-shot learning. 3. **Main Results**: - Qualitative accuracy assessment through expert review, with secondary results including quantitative measurement of the similarity between LLM-generated responses and expert-provided answers using text similarity scores. - Results showed that accuracy significantly increased from 43% to 99% (p<0.001) when guidelines were provided as context. - Structured guideline reformatting and advanced prompt engineering (data quality and quantity) can enhance the effectiveness of LLMs in CDSSs. 4. **Discussion**: - The study indicates that LLMs struggle with parsing non-text sources like tables, but performance can be significantly improved by converting tables into text-based lists. - Structured guideline reformatting and advanced prompt engineering are crucial for improving LLM accuracy. - Although few-shot learning did not significantly enhance overall accuracy, the study emphasizes the importance of further research to improve LLMs' ability to parse non-text sources and validate new evaluation metrics that measure not only similarity but also the accuracy of clinical LLM applications. ### Conclusion The study results suggest that LLMs like GPT-4 Turbo are suitable for parsing clinical guidelines, but their effectiveness can be enhanced through structured formatting strategies, prompt engineering, and converting non-text sources into text. Additionally, the research indicates that proper reformatting may render few-shot learning unnecessary for increasing overall accuracy. Future research needs to further enhance LLMs' ability to parse non-text sources and validate new evaluation metrics to comprehensively assess the accuracy and similarity of clinical LLM applications.