Abstract:Despite the impressive capabilities of Large Language Models (LLMs) in general medical domains, questions remain about their performance in diagnosing rare diseases. To answer this question, we aim to assess the diagnostic performance of LLMs in rare diseases, and explore methods to enhance their effectiveness in this area. In this work, we introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of LLMs in diagnosing rare diseases. Specifically, we collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. Additionally, we annotated meta-data for each question, facilitating the extraction of subsets specific to any given disease and its property. Based on the ReDis-QA dataset, we benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. To facilitate retrieval augmentation generation for rare disease diagnosis, we collect the first rare diseases corpus (ReCOP), sourced from the National Organization for Rare Disorders (NORD) database. Specifically, we split the report of each rare disease into multiple chunks, each representing a different property of the disease, including their overview, symptoms, causes, effects, related disorders, diagnosis, and standard therapies. This structure ensures that the information within each chunk aligns consistently with a question. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%. Moreover, it significantly guides LLMs to generate trustworthy answers and explanations that can be traced back to existing literature.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to evaluate and improve the performance of large language models (LLMs) in the diagnosis of rare diseases. Although LLMs perform well in the general medical field, their ability in the diagnosis of rare diseases is still in question. Specifically, the paper aims to: 1. **Evaluate the performance of LLMs in the diagnosis of rare diseases**: - A high - quality dataset (ReDis - QA) specifically for rare - disease Q&A was constructed, containing 1,360 question - answer pairs, covering 205 rare diseases. - Based on the ReDis - QA dataset, several open - source LLMs were benchmarked, revealing the challenges of these models in the diagnosis of rare diseases. 2. **Explore methods to improve the effectiveness of LLMs in the diagnosis of rare diseases**: - The first rare - disease corpus (ReCOP) was collected and constructed. This corpus is derived from the National Organization for Rare Disorders (NORD) database and covers multiple aspects of rare diseases, such as overviews, symptoms, causes, effects, related diseases, diagnosis, and standard therapies. - Through retrieval - augmented generation (RAG) technology, the external knowledge provided by ReCOP is used to improve the diagnostic accuracy of LLMs. ### Main findings - **Benchmarking results**: Experiments show that current open - source LLMs still face significant challenges in the diagnosis of rare diseases, and the accuracy rate is generally low. - **The effect of ReCOP**: After retrieval - augmented generation using ReCOP, the diagnostic accuracy rate of LLMs is increased by an average of 8%, and more credible answers and explanations can be generated, and these explanations can be traced back to existing literature. ### Key points of the solution 1. **Constructing the ReDis - QA dataset**: In order to evaluate the performance of LLMs in the diagnosis of rare diseases, the authors collected and annotated 1,360 high - quality question - answer pairs, and each question is accompanied by metadata for extracting subsets of specific diseases. 2. **Developing the ReCOP corpus**: In order to address the deficiencies of LLMs in the diagnosis of rare diseases, the authors collected rare - disease reports from the NORD database and divided them into multiple small segments (chunks), each corresponding to different attributes of the disease. 3. **Applying retrieval - augmented generation (RAG)**: By combining ReCOP and existing retrieval algorithms (such as BM25, MedCPT, etc.), the performance of LLMs in the diagnosis of rare diseases is significantly improved. In conclusion, through constructing dedicated datasets and corpora and applying retrieval - augmented generation technology, this paper successfully evaluates and improves the ability of LLMs in the diagnosis of rare diseases.

Assessing and Enhancing Large Language Models in Rare Disease Question-answering

RareBench: Can LLMs Serve as Rare Diseases Specialists?

A Hybrid Framework with Large Language Models for Rare Disease Phenotyping

MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Enhancing Healthcare through Large Language Models: A Study on Medical Question Answering

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

Large Language Models for Disease Diagnosis: A Scoping Review

[Influence of the 4 thrombus stages on fibrinolysis using streptokinase].

Zebra-Llama: A Context-Aware Large Language Model for Democratizing Rare Disease Knowledge

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Integrating UMLS Knowledge into Large Language Models for Medical Question Answering

Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models

Large Language Model Benchmarks in Medical Tasks

On the role of the UMLS in supporting diagnosis generation proposed by Large Language Models

A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

RJUA-MedDQA: A Multimodal Benchmark for Medical Document Question Answering and Clinical Reasoning

Evaluating large language models on medical, lay language, and self-reported descriptions of genetic conditions

Large Language Models in Healthcare: A Comprehensive Benchmark

Do Large Language Models have Shared Weaknesses in Medical Question Answering?