Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Anson Li,Renee Shrestha,Thinoj Jegatheeswaran,Hannah O. Chan,Colin Hong,Rakesh Joshi
DOI: https://doi.org/10.1101/2024.09.27.24314506
2024-09-28
Abstract:The integration of Large Language Models (LLMs) is increasingly recognized for its potential to enhance various aspects of healthcare, including patient care, medical research, and education. The well-known LLM from Open AI: ChatGPT, a user-friendly GPT-4 based chatbot, has become increasingly popular. However, current limitations to LLMs, such as hallucinations, outdated information, and ethical and legal complications may pose significant risks to patients and contribute to the spread of medical disinformation. This study focuses on the application of Retrieval-Augmented Generation (RAG) to mitigate common limitations of LLMs like ChatGPT and assess its effectiveness in summarizing and organizing medical information. Up-to-date clinical guidelines were utilized as the source of information to create detailed medical templates. These were evaluated against human-generated templates by a panel of physicians, using Likert scales for accuracy and usefulness, and programmatically using BERTScores for textual similarity. The LLM templates scored higher on average for both accuracy and usefulness when compared to human-generated templates. BERTScore analysis further showed high textual similarity between ChatGPT- and Human-generated templates. These results indicate that RAG-enhanced LLM prompting can effectively summarize and organize medical information, demonstrating high potential for use in clinical settings.
What problem does this paper attempt to address?
The paper attempts to address the issue of "hallucination" in large language models (LLMs) during medical consultations, where the models generate non-existent or incorrect content. Specifically, the study aims to mitigate this hallucination phenomenon by introducing Retrieval-Augmented Generation (RAG) technology and evaluating its effectiveness in summarizing and organizing medical information. ### Main Issues: 1. **Hallucination Phenomenon**: LLMs like ChatGPT may generate non-existent or incorrect information when creating medical consultation templates, which can pose risks to patients and lead to medical misinformation. 2. **Outdated Information**: The knowledge base of LLMs is not updated in a timely manner, which may result in generated information that does not align with the latest clinical guidelines. 3. **Ethical and Legal Issues**: Using LLMs for diagnosis and treatment planning involves issues of responsibility and accuracy, potentially leading to legal disputes and ethical concerns. ### Solution: - **RAG Technology**: By referencing external supplementary information sources, RAG can provide LLMs with the latest, reliable, and specific information, reducing hallucination phenomena and improving the accuracy and reliability of generated content. - **Template Generation**: The study uses RAG-enhanced ChatGPT to generate medical consultation templates and compares them with human-generated templates to evaluate their accuracy and practicality. ### Research Methods: - **Data Source**: The latest clinical guidelines are used as the information source to generate electronic medical record (EMR) templates. - **Template Generation**: Methods such as few-shot prompting, directional stimulus prompting, and self-correction prompting are used to guide ChatGPT in generating detailed medical consultation templates. - **Evaluation Method**: A group of doctors evaluates the accuracy and practicality of the templates using the Likert scale, and text similarity analysis is conducted using BERTScore. ### Research Results: - **Accuracy**: The templates generated by ChatGPT scored higher in accuracy, with an average score of 4.62, compared to 4.47 for human-generated templates. - **Practicality**: The templates generated by ChatGPT also performed better in practicality, with an average score of 4.13, compared to 3.71 for human-generated templates. - **Text Similarity**: BERTScore analysis shows that the templates generated by ChatGPT have very high text similarity to human-generated templates, with an average F1 score of 0.84. ### Conclusion: The research results indicate that RAG-enhanced LLMs can effectively summarize and organize medical information, with the generated templates outperforming human-generated templates in both accuracy and practicality. This provides strong support for the application of LLMs in clinical settings, but also highlights areas for further improvement, such as reducing occasional inaccuracies and improving the consistency of generated documents.