Assessing and Enhancing Large Language Models in Rare Disease Question-answering

Guanchu Wang,Junhao Ran,Ruixiang Tang,Chia-Yuan Chang,Chia-Yuan Chang,Yu-Neng Chuang,Zirui Liu,Vladimir Braverman,Zhandong Liu,Xia Hu
2024-08-16
Abstract:Despite the impressive capabilities of Large Language Models (LLMs) in general medical domains, questions remain about their performance in diagnosing rare diseases. To answer this question, we aim to assess the diagnostic performance of LLMs in rare diseases, and explore methods to enhance their effectiveness in this area. In this work, we introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of LLMs in diagnosing rare diseases. Specifically, we collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. Additionally, we annotated meta-data for each question, facilitating the extraction of subsets specific to any given disease and its property. Based on the ReDis-QA dataset, we benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. To facilitate retrieval augmentation generation for rare disease diagnosis, we collect the first rare diseases corpus (ReCOP), sourced from the National Organization for Rare Disorders (NORD) database. Specifically, we split the report of each rare disease into multiple chunks, each representing a different property of the disease, including their overview, symptoms, causes, effects, related disorders, diagnosis, and standard therapies. This structure ensures that the information within each chunk aligns consistently with a question. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%. Moreover, it significantly guides LLMs to generate trustworthy answers and explanations that can be traced back to existing literature.
Computational Engineering, Finance, and Science,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate and improve the performance of large language models (LLMs) in the diagnosis of rare diseases. Although LLMs perform well in the general medical field, their ability in the diagnosis of rare diseases is still in question. Specifically, the paper aims to: 1. **Evaluate the performance of LLMs in the diagnosis of rare diseases**: - A high - quality dataset (ReDis - QA) specifically for rare - disease Q&A was constructed, containing 1,360 question - answer pairs, covering 205 rare diseases. - Based on the ReDis - QA dataset, several open - source LLMs were benchmarked, revealing the challenges of these models in the diagnosis of rare diseases. 2. **Explore methods to improve the effectiveness of LLMs in the diagnosis of rare diseases**: - The first rare - disease corpus (ReCOP) was collected and constructed. This corpus is derived from the National Organization for Rare Disorders (NORD) database and covers multiple aspects of rare diseases, such as overviews, symptoms, causes, effects, related diseases, diagnosis, and standard therapies. - Through retrieval - augmented generation (RAG) technology, the external knowledge provided by ReCOP is used to improve the diagnostic accuracy of LLMs. ### Main findings - **Benchmarking results**: Experiments show that current open - source LLMs still face significant challenges in the diagnosis of rare diseases, and the accuracy rate is generally low. - **The effect of ReCOP**: After retrieval - augmented generation using ReCOP, the diagnostic accuracy rate of LLMs is increased by an average of 8%, and more credible answers and explanations can be generated, and these explanations can be traced back to existing literature. ### Key points of the solution 1. **Constructing the ReDis - QA dataset**: In order to evaluate the performance of LLMs in the diagnosis of rare diseases, the authors collected and annotated 1,360 high - quality question - answer pairs, and each question is accompanied by metadata for extracting subsets of specific diseases. 2. **Developing the ReCOP corpus**: In order to address the deficiencies of LLMs in the diagnosis of rare diseases, the authors collected rare - disease reports from the NORD database and divided them into multiple small segments (chunks), each corresponding to different attributes of the disease. 3. **Applying retrieval - augmented generation (RAG)**: By combining ReCOP and existing retrieval algorithms (such as BM25, MedCPT, etc.), the performance of LLMs in the diagnosis of rare diseases is significantly improved. In conclusion, through constructing dedicated datasets and corpora and applying retrieval - augmented generation technology, this paper successfully evaluates and improves the ability of LLMs in the diagnosis of rare diseases.