How well do LLMs cite relevant medical references? An evaluation framework and analyses

Kevin Wu,Eric Wu,Ally Cassasola,Angela Zhang,Kevin Wei,Teresa Nguyen,Sith Riantawan,Patricia Shi Riantawan,Daniel E. Ho,James Zou
2024-02-03
Abstract:Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. Recent top-performing commercial LLMs, in particular, are also capable of citing sources to support their responses. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? To answer this, we propose three contributions. First, as expert medical annotations are an expensive and time-consuming bottleneck for scalable evaluation, we demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors. Second, we develop an end-to-end, automated pipeline called \textit{SourceCheckup} and use it to evaluate five top-performing LLMs on a dataset of 1200 generated questions, totaling over 40K pairs of statements and sources. Interestingly, we find that between ~50% to 90% of LLM responses are not fully supported by the sources they provide. We also evaluate GPT-4 with retrieval augmented generation (RAG) and find that, even still, around 30\% of individual statements are unsupported, while nearly half of its responses are not fully supported. Third, we open-source our curated dataset of medical questions and expert annotations for future evaluations. Given the rapid pace of LLM development and the potential harms of incorrect or outdated medical information, it is crucial to also understand and quantify their capability to produce relevant, trustworthy medical references.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to evaluate the ability of large language models (LLMs) to cite relevant medical references when answering medical questions. Specifically, the paper explores the following key issues: 1. **Citation Reliability**: Do the citations generated by LLMs actually support the claims they make? 2. **Automated Evaluation Framework**: How can we efficiently and accurately assess the reliability of LLMs' citations? 3. **Performance of Different Models**: How do current top commercial LLMs perform in terms of citation reliability? ### Background and Motivation As LLMs are increasingly applied in the medical field, their ability to cite relevant literature when answering medical questions becomes particularly important. However, LLMs may exhibit "hallucination" phenomena, generating statements without source support. This is especially dangerous in the medical field, as incorrect advice can cause serious harm to patients. Therefore, assessing the reliability and accuracy of LLMs' citations is crucial to ensuring their safe use in clinical medicine. ### Main Contributions 1. **Automated Evaluation Framework**: The authors propose an end-to-end automated pipeline called SourceCheckup to evaluate the reliability of LLMs' citations. This framework includes four modules: question generation, LLM question answering, statement and URL source parsing, and source verification. 2. **Performance Evaluation**: Using SourceCheckup, the authors evaluated five top commercial LLMs (GPT-4 (RAG and API), Claude v2.1, Mistral Medium, Gemini Pro). The results show that even the most advanced model (e.g., GPT-4 (RAG)) has a response-level support rate of only 54%. 3. **Dataset Release**: The authors released a dataset containing 1200 medical questions and 284 clinician-annotated question/answer pairs for future research use. ### Methods 1. **Question Generation**: Text was extracted from webpages of MayoClinic, UpToDate, and Reddit r/AskDocs, and new medical questions were generated using GPT-4. 2. **LLM Question Answering**: The generated questions were submitted to five LLMs to obtain each model's answers and their cited URLs. 3. **Statement and URL Source Parsing**: GPT-4 was used to parse each model's answers, breaking them down into individual statements and downloading the cited URLs. 4. **Source Verification**: GPT-4 was used as the source verification model to determine whether each statement was supported by the provided sources. ### Results 1. **Source URL Validity**: The proportion of valid URLs generated by each model was evaluated. For example, GPT-4 (RAG) had the highest proportion of valid URLs, but about 20% of responses still failed to generate any citations. 2. **Statement-Level Support**: The proportion of statements generated by each model that were supported by at least one source was evaluated. For example, GPT-4 (RAG) had a statement-level support rate of around 70%. 3. **Response-Level Support**: The proportion of responses from each model that were fully supported by all statements was evaluated. For example, GPT-4 (RAG) had a response-level support rate of only 54%. ### Discussion 1. **Accuracy of Automated Evaluation**: The SourceCheckup framework performed well in verifying source reliability, achieving an 88% consensus rate with three practicing U.S. physicians. 2. **Model Performance Differences**: There were significant differences in citation reliability among different LLMs, especially when handling open-ended questions from sources like Reddit r/AskDocs. 3. **Challenges in Practical Application**: Although retrieval-augmented generation (RAG) technology has improved LLMs' citation capabilities, many unresolved issues remain, requiring further research and improvement. ### Conclusion This paper systematically evaluates the performance of LLMs in citing relevant medical references through the proposed SourceCheckup framework, revealing the current models' deficiencies in citation reliability. These findings are significant for regulators, clinicians, and patients, helping to promote the safe and reliable application of LLMs in the medical field.