Enhancing Large Language Models with Domain-specific Retrieval Augment Generation: A Case Study on Long-form Consumer Health Question Answering in Ophthalmology

Aidan Gilson,Xuguang Ai,Thilaka Arunachalam,Ziyou Chen,Ki Xiong Cheong,Amisha Dave,Cameron Duic,Mercy Kibe,Annette Kaminaka,Minali Prasad,Fares Siddig,Maxwell Singer,Wendy Wong,Qiao Jin,Tiarnan D.L. Keenan,Xia Hu,Emily Y. Chew,Zhiyong Lu,Hua Xu,Ron A. Adelman,Yih-Chung Tham,Qingyu Chen
2024-09-21
Abstract:Despite the potential of Large Language Models (LLMs) in medicine, they may generate responses lacking supporting evidence or based on hallucinated evidence. While Retrieval Augment Generation (RAG) is popular to address this issue, few studies implemented and evaluated RAG in downstream domain-specific applications. We developed a RAG pipeline with 70,000 ophthalmology-specific documents that retrieve relevant documents to augment LLMs during inference time. In a case study on long-form consumer health questions, we systematically evaluated the responses including over 500 references of LLMs with and without RAG on 100 questions with 10 healthcare professionals. The evaluation focuses on factuality of evidence, selection and ranking of evidence, attribution of evidence, and answer accuracy and completeness. LLMs without RAG provided 252 references in total. Of which, 45.3% hallucinated, 34.1% consisted of minor errors, and 20.6% were correct. In contrast, LLMs with RAG significantly improved accuracy (54.5% being correct) and reduced error rates (18.8% with minor hallucinations and 26.7% with errors). 62.5% of the top 10 documents retrieved by RAG were selected as the top references in the LLM response, with an average ranking of 4.9. The use of RAG also improved evidence attribution (increasing from 1.85 to 2.49 on a 5-point scale, P<0.001), albeit with slight decreases in accuracy (from 3.52 to 3.23, P=0.03) and completeness (from 3.47 to 3.27, P=0.17). The results demonstrate that LLMs frequently exhibited hallucinated and erroneous evidence in the responses, raising concerns for downstream applications in the medical domain. RAG substantially reduced the proportion of such evidence but encountered challenges.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the issue of large language models (LLMs) generating responses in medical applications that lack evidence support or are based on fabricated evidence. Despite the excellent performance of LLMs in natural language processing tasks, their application in the medical field often results in inaccurate or fictitious information. To tackle this problem, the paper develops a retrieval-augmented generation (RAG) pipeline specific to the ophthalmology domain and evaluates its effectiveness in long-form consumer health question answering through a case study. Specifically, the main contributions of the paper include: 1. **Building a domain-specific corpus**: The paper collects approximately 70,000 ophthalmology-related literature, clinical guidelines, and relevant wiki articles to construct the RAG pipeline. 2. **Systematic evaluation**: Through a question-answering task involving 100 consumer health questions, the paper systematically evaluates the answers generated by LLMs with and without RAG, focusing on the authenticity, selection, and ranking of evidence, as well as the accuracy and completeness of the answers. 3. **Open data and code**: The relevant data, models, and code are made publicly available to facilitate community reproduction and further development. The paper finds that while RAG significantly improves the authenticity of evidence and reduces the error rate, in some cases, LLMs do not fully utilize the most relevant documents provided by RAG, leading to the presence of fabricated evidence. Additionally, irrelevant documents introduced by RAG may reduce the accuracy and completeness of the answers. These results indicate that the RAG approach is more effective than non-RAG methods in long-form medical question answering, but further improvements are needed to enhance the quality of evidence retrieval, selection, and attribution.