Abstract:Large Language Models (LLMs) show potential for medical applications but often lack specialized clinical knowledge. Retrieval Augmented Generation (RAG) allows customization with domain-specific information, making it suitable for healthcare. This study evaluates the accuracy, consistency, and safety of RAG models in determining fitness for surgery and providing preoperative instructions. We developed LLM-RAG models using 35 local and 23 international preoperative guidelines and tested them against human-generated responses. A total of 3,682 responses were evaluated. Clinical documents were processed using Llamaindex, and 10 LLMs, including GPT3.5, GPT4, and Claude-3, were assessed. Fourteen clinical scenarios were analyzed, focusing on seven aspects of preoperative instructions. Established guidelines and expert judgment were used to determine correct responses, with human-generated answers serving as comparisons. The LLM-RAG models generated responses within 20 seconds, significantly faster than clinicians (10 minutes). The GPT4 LLM-RAG model achieved the highest accuracy (96.4% vs. 86.6%, p=0.016), with no hallucinations and producing correct instructions comparable to clinicians. Results were consistent across both local and international guidelines. This study demonstrates the potential of LLM-RAG models for preoperative healthcare tasks, highlighting their efficiency, scalability, and reliability.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the lack of specific domain knowledge in large language models (LLMs) for medical applications, particularly for preoperative assessment and management tasks. Specifically, the research team developed and evaluated an LLM model based on Retrieval-Augmented Generation (RAG) technology to improve the accuracy, consistency, and safety of these models in determining patient surgical suitability and providing preoperative guidance. ### Main Research Objectives 1. **Primary Objective**: - Evaluate the accuracy of the LLM-RAG model in determining whether a patient is suitable for surgery. 2. **Secondary Objectives**: - Assess the ability of the LLM-RAG model to provide accurate, consistent, and safe preoperative guidance, including: - Whether the patient needs to be assessed by a nurse or doctor. - Fasting guidelines. - Applicability of preoperative carbohydrate loading. - Medication management instructions. - Instructions for the medical team. - Types of preoperative optimization required. ### Research Background 1. **Existing Research**: - Large language models have already demonstrated human-comparable performance in basic clinical tasks, but for complex tasks such as clinical assessment and management, their responses rely solely on pre-trained knowledge, lacking support from institutional practice guidelines. - The hallucination problem of LLMs (i.e., generating incorrect or misleading information) poses significant challenges to safety and ethics. - Inadequate preoperative assessment leads to cancellations on the day of surgery, resulting in substantial economic burdens for hospitals. 2. **Significance of the Research**: - By optimizing LLMs, particularly using RAG technology, the performance of models in specific tasks can be improved, reducing hallucination issues and providing more personalized and accurate preoperative guidance. - Traditional preoperative assessments are time-consuming and costly, whereas RAG-based LLMs can significantly improve efficiency and reduce the waste of medical resources. ### Methods 1. **Data Sources**: - Use 35 local and 23 international preoperative guidelines as the knowledge base. - Evaluate 14 de-identified clinical scenarios covering a variety of patient and surgical complexities. 2. **Model Selection**: - Evaluate 10 different LLMs, including GPT3.5, GPT4, LLAMA2 series, Gemini-1.5-Pro, and Claude-3-Opus. 3. **Evaluation Framework**: - Use the S.C.O.R.E. framework (Safety, Consensus, Objectivity, Repeatability, and Explainability) for qualitative evaluation of the responses generated by the LLMs. ### Results 1. **Performance Evaluation**: - The GPT4-International model performed best in assessing patient surgical suitability, with an accuracy of 96.4%, significantly higher than human-generated answers (86.6%). - The GPT4-Local model achieved an accuracy of 93.0% in determining whether a patient needs to be assessed by a nurse or doctor, significantly better than non-RAG models (86.0%). - In assessing the surgical suitability of ASA 3 patients, the GPT4 model also performed significantly better than other models. 2. **Secondary Results**: - The GPT4-International model outperformed humans in generating the required medical optimizations (71.0% vs 55.0%), but was slightly inferior to humans in medication management instructions (91.0% vs 98.0%). - Overall, there was no significant difference in accuracy between the GPT4-International model and human-generated answers (83.0% vs 81.0%). ### Conclusion This study successfully implemented an RAG-based LLM model for preoperative medical tasks, emphasizing the importance of baseline knowledge, upgradability, and scalability in medical scenarios. These models performed excellently in assessing patient surgical suitability and providing preoperative guidance, showing potential application prospects.

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

Retrieval-augmented large language models for clinical trial screening.

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Transforming Healthcare Education: Harnessing Large Language Models for Frontline Health Worker Capacity Building using Retrieval-Augmented Generation

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

A study of generative large language model for medical research and healthcare

Evaluation of General Large Language Models in Contextually Assessing Semantic Concepts Extracted from Adult Critical Care Electronic Health Record Notes

Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models

Evaluating multiple large language models in pediatric ophthalmology

BiomedRAG: A Retrieval Augmented Large Language Model for Biomedicine

Bailicai: A Domain-Optimized Retrieval-Augmented Generation Framework for Medical Applications

Determinants of exercise-induced changes in mitral regurgitation in patients with coronary artery disease and left ventricular dysfunction.

Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models