Abstract:The integration of Large Language Models (LLMs) is increasingly recognized for its potential to enhance various aspects of healthcare, including patient care, medical research, and education. The well-known LLM from Open AI: ChatGPT, a user-friendly GPT-4 based chatbot, has become increasingly popular. However, current limitations to LLMs, such as hallucinations, outdated information, and ethical and legal complications may pose significant risks to patients and contribute to the spread of medical disinformation. This study focuses on the application of Retrieval-Augmented Generation (RAG) to mitigate common limitations of LLMs like ChatGPT and assess its effectiveness in summarizing and organizing medical information. Up-to-date clinical guidelines were utilized as the source of information to create detailed medical templates. These were evaluated against human-generated templates by a panel of physicians, using Likert scales for accuracy and usefulness, and programmatically using BERTScores for textual similarity. The LLM templates scored higher on average for both accuracy and usefulness when compared to human-generated templates. BERTScore analysis further showed high textual similarity between ChatGPT- and Human-generated templates. These results indicate that RAG-enhanced LLM prompting can effectively summarize and organize medical information, demonstrating high potential for use in clinical settings.

What problem does this paper attempt to address?

The paper attempts to address the issue of "hallucination" in large language models (LLMs) during medical consultations, where the models generate non-existent or incorrect content. Specifically, the study aims to mitigate this hallucination phenomenon by introducing Retrieval-Augmented Generation (RAG) technology and evaluating its effectiveness in summarizing and organizing medical information. ### Main Issues: 1. **Hallucination Phenomenon**: LLMs like ChatGPT may generate non-existent or incorrect information when creating medical consultation templates, which can pose risks to patients and lead to medical misinformation. 2. **Outdated Information**: The knowledge base of LLMs is not updated in a timely manner, which may result in generated information that does not align with the latest clinical guidelines. 3. **Ethical and Legal Issues**: Using LLMs for diagnosis and treatment planning involves issues of responsibility and accuracy, potentially leading to legal disputes and ethical concerns. ### Solution: - **RAG Technology**: By referencing external supplementary information sources, RAG can provide LLMs with the latest, reliable, and specific information, reducing hallucination phenomena and improving the accuracy and reliability of generated content. - **Template Generation**: The study uses RAG-enhanced ChatGPT to generate medical consultation templates and compares them with human-generated templates to evaluate their accuracy and practicality. ### Research Methods: - **Data Source**: The latest clinical guidelines are used as the information source to generate electronic medical record (EMR) templates. - **Template Generation**: Methods such as few-shot prompting, directional stimulus prompting, and self-correction prompting are used to guide ChatGPT in generating detailed medical consultation templates. - **Evaluation Method**: A group of doctors evaluates the accuracy and practicality of the templates using the Likert scale, and text similarity analysis is conducted using BERTScore. ### Research Results: - **Accuracy**: The templates generated by ChatGPT scored higher in accuracy, with an average score of 4.62, compared to 4.47 for human-generated templates. - **Practicality**: The templates generated by ChatGPT also performed better in practicality, with an average score of 4.13, compared to 3.71 for human-generated templates. - **Text Similarity**: BERTScore analysis shows that the templates generated by ChatGPT have very high text similarity to human-generated templates, with an average F1 score of 0.84. ### Conclusion: The research results indicate that RAG-enhanced LLMs can effectively summarize and organize medical information, with the generated templates outperforming human-generated templates in both accuracy and practicality. This provides strong support for the application of LLMs in clinical settings, but also highlights areas for further improvement, such as reducing occasional inaccuracies and improving the consistency of generated documents.

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models

Based on Medicine, The Now and Future of Large Language Models

[Relationship between psychological and physiological dependence and drug addiction].

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Development of a liver disease-Specific large language model chat Interface using retrieval augmented generation

Bailicai: A Domain-Optimized Retrieval-Augmented Generation Framework for Medical Applications

Large Language Model Prompting Techniques for Advancement in Clinical Medicine

The effect of glutathione on the determination of blood-sugar.

A Study of Generative Large Language Model for Medical Research and Healthcare

Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

Minimizing Factual Inconsistency and Hallucination in Large Language Models

Large language model application in emergency medicine and critical care

The future landscape of large language models in medicine

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Can AI Relate: Testing Large Language Model Response for Mental Health Support

Augmented non-hallucinating large language models as medical information curators

Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment