Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Aidan Gilson,Xuguang Ai,Thilaka Arunachalam,Ziyou Chen,Ki Xiong Cheong,Amisha Dave,Cameron Duic,Mercy Kibe,Annette Kaminaka,Minali Prasad,Fares Siddig,Maxwell Singer,Wendy Wong,Qiao Jin,Tiarnan D.L. Keenan,Xia Hu,Emily Y. Chew,Zhiyong Lu,Hua Xu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen

2024-09-21

Abstract:Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.

Computation and Language,Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the issue of large language models (LLMs) generating responses in medical applications that lack evidence support or are based on fabricated evidence. Despite the excellent performance of LLMs in natural language processing tasks, their application in the medical field often results in inaccurate or fictitious information. To tackle this problem, the paper develops a retrieval-augmented generation (RAG) pipeline specific to the ophthalmology domain and evaluates its effectiveness in long-form consumer health question answering through a case study. Specifically, the main contributions of the paper include: 1. **Building a domain-specific corpus**: The paper collects approximately 70,000 ophthalmology-related literature, clinical guidelines, and relevant wiki articles to construct the RAG pipeline. 2. **Systematic evaluation**: Through a question-answering task involving 100 consumer health questions, the paper systematically evaluates the answers generated by LLMs with and without RAG, focusing on the authenticity, selection, and ranking of evidence, as well as the accuracy and completeness of the answers. 3. **Open data and code**: The relevant data, models, and code are made publicly available to facilitate community reproduction and further development. The paper finds that while RAG significantly improves the authenticity of evidence and reduces the error rate, in some cases, LLMs do not fully utilize the most relevant documents provided by RAG, leading to the presence of fabricated evidence. Additionally, irrelevant documents introduced by RAG may reduce the accuracy and completeness of the answers. These results indicate that the RAG approach is more effective than non-RAG methods in long-form medical question answering, but further improvements are needed to enhance the quality of evidence retrieval, selection, and attribution.

Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Tool Calling: Enhancing Medication Consultation via Retrieval-Augmented Large Language Models

Retrieval-Augmented Generation for Large Language Models: A Survey

Rationale-Guided Retrieval Augmented Generation for Medical Question Answering

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

Enhancing LLM Factual Accuracy with RAG to Counter Hallucinations: A Case Study on Domain-Specific Queries in Private Knowledge-Bases

Advancing Question-Answering in Ophthalmology with Retrieval Augmented Generations (RAG): Benchmarking Open-source and Proprietary Large Language Models

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering

Improving Retrieval-Augmented Generation in Medicine with Iterative Follow-up Questions

Benchmarking Retrieval-Augmented Generation for Medicine

Answering real-world clinical questions using large language model based systems

Retrieval-augmented large language models for clinical trial screening.

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

Medical, moral and legal aspects of renal replacement therapy.

JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability

The Geometry of Queries: Query-Based Innovations in Retrieval-Augmented Generation

SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering