oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Yu He Ke,Liyuan Jin,Kabilan Elangovan,Hairil Rizal Abdullah,Nan Liu,Alex Tiong Heng Sia,Chai Rick Soh,Joshua Yi Min Tung,Jasmine Chiat Ling Ong,Chang-Fu Kuo,Shao-Chun Wu,Vesela P. Kovacheva,Daniel Shu Wei Ting
2024-10-11
Abstract:Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address the lack of specific domain knowledge in large language models (LLMs) for medical applications, particularly for preoperative assessment and management tasks. Specifically, the research team developed and evaluated an LLM model based on Retrieval-Augmented Generation (RAG) technology to improve the accuracy, consistency, and safety of these models in determining patient surgical suitability and providing preoperative guidance. ### Main Research Objectives 1. **Primary Objective**: - Evaluate the accuracy of the LLM-RAG model in determining whether a patient is suitable for surgery. 2. **Secondary Objectives**: - Assess the ability of the LLM-RAG model to provide accurate, consistent, and safe preoperative guidance, including: - Whether the patient needs to be assessed by a nurse or doctor. - Fasting guidelines. - Applicability of preoperative carbohydrate loading. - Medication management instructions. - Instructions for the medical team. - Types of preoperative optimization required. ### Research Background 1. **Existing Research**: - Large language models have already demonstrated human-comparable performance in basic clinical tasks, but for complex tasks such as clinical assessment and management, their responses rely solely on pre-trained knowledge, lacking support from institutional practice guidelines. - The hallucination problem of LLMs (i.e., generating incorrect or misleading information) poses significant challenges to safety and ethics. - Inadequate preoperative assessment leads to cancellations on the day of surgery, resulting in substantial economic burdens for hospitals. 2. **Significance of the Research**: - By optimizing LLMs, particularly using RAG technology, the performance of models in specific tasks can be improved, reducing hallucination issues and providing more personalized and accurate preoperative guidance. - Traditional preoperative assessments are time-consuming and costly, whereas RAG-based LLMs can significantly improve efficiency and reduce the waste of medical resources. ### Methods 1. **Data Sources**: - Use 35 local and 23 international preoperative guidelines as the knowledge base. - Evaluate 14 de-identified clinical scenarios covering a variety of patient and surgical complexities. 2. **Model Selection**: - Evaluate 10 different LLMs, including GPT3.5, GPT4, LLAMA2 series, Gemini-1.5-Pro, and Claude-3-Opus. 3. **Evaluation Framework**: - Use the S.C.O.R.E. framework (Safety, Consensus, Objectivity, Repeatability, and Explainability) for qualitative evaluation of the responses generated by the LLMs. ### Results 1. **Performance Evaluation**: - The GPT4-International model performed best in assessing patient surgical suitability, with an accuracy of 96.4%, significantly higher than human-generated answers (86.6%). - The GPT4-Local model achieved an accuracy of 93.0% in determining whether a patient needs to be assessed by a nurse or doctor, significantly better than non-RAG models (86.0%). - In assessing the surgical suitability of ASA 3 patients, the GPT4 model also performed significantly better than other models. 2. **Secondary Results**: - The GPT4-International model outperformed humans in generating the required medical optimizations (71.0% vs 55.0%), but was slightly inferior to humans in medication management instructions (91.0% vs 98.0%). - Overall, there was no significant difference in accuracy between the GPT4-International model and human-generated answers (83.0% vs 81.0%). ### Conclusion This study successfully implemented an RAG-based LLM model for preoperative medical tasks, emphasizing the importance of baseline knowledge, upgradability, and scalability in medical scenarios. These models performed excellently in assessing patient surgical suitability and providing preoperative guidance, showing potential application prospects.