Abstract:Abstract Background: Diagnosing rare genetic disorders relies on precise phenotypic and genotypic analysis, with the Human Phenotype Ontology (HPO) providing a standardized language for capturing clinical phenotypes. Traditional HPO tools, such as Doc2HPO and ClinPhen, employ concept recognition to automate phenotype extraction but struggle with incomplete phenotype assignment, often requiring intensive manual review. While large language models (LLMs) hold promise for more context-driven phenotype extraction, they are prone to errors and hallucinations, making them less reliable without further refinement. We present RAG-HPO, a Python-based tool that leverages Retrieval-Augmented Generation (RAG) to elevate LLM accuracy in HPO term assignment, bypassing the limitations of baseline models while avoiding the time and resource intensive process of fine-tuning. RAG-HPO integrates a dynamic vector database, allowing real-time retrieval and contextual matching. Methods: The high-dimensional vector database utilized by RAG-HPO includes >54,000 phenotypic phrases mapped to HPO IDs, derived from the HPO database and supplemented with additional validated phrases. The RAG-HPO workflow uses an LLM to first extract phenotypic phrases that are then matched via semantic similarity to entries within a vector database before providing best term matches back to the LLM as context for final HPO term assignment. A benchmarking dataset of 120 published case reports with 1,792 manually-assigned HPO terms was developed, and the performance of RAG-HPO measured against existing published tools Doc2HPO, ClinPhen, and FastHPOCR. Results: In evaluations, RAG-HPO, powered by Llama-3 70B and applied to a set of 120 case reports, achieved a mean precision of 0.84, recall of 0.78, and an F1 score of 0.80-significantly surpassing conventional tools (p<0.00001). False positive HPO term identification occurred for 15.8% (256/1,624) of terms, of which only 2.7% (7/256) represented hallucinations, and 33.6% (86/256) unrelated terms; the remainder of false positives (63.7%, 163/256) were relative terms of the target term. Conclusions: RAG-HPO is a user-friendly, adaptable tool designed for secure evaluation of clinical text and outperforms standard HPO-matching tools in precision, recall, and F1. Its enhanced precision and recall represent a substantial advancement in phenotypic analysis, accelerating the identification of genetic mechanisms underlying rare diseases and driving progress in genetic research and clinical genomics.

Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation

A Robust Phenotype-Driven Likelihood Ratio Analysis Approach Assisting Interpretable Clinical Diagnosis of Rare Diseases.

Large Language Models with Retrieval-Augmented Generation for Zero-Shot Disease Phenotyping

Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models

Diagnostic Accuracy of a Custom Large Language Model on Rare Pediatric Disease Case Reports

Assessing the Utility of Large Language Models for Phenotype-Driven Gene Prioritization in Rare Genetic Disorder Diagnosis

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms

Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease

Identifying and Extracting Rare Diseases and Their Phenotypes with Large Language Models

A Simplified Retriever to Improve Accuracy of Phenotype Normalizations by Large Language Models

A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes

Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG-enhanced vs. Human-Generated Medical Templates

Automating Clinical Phenotyping Using Natural Language Processing: An Application for Crohn's Disease

Retrieval-augmented large language models for clinical trial screening.

Identifying and Extracting Rare Disease Phenotypes with Large Language Models

Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT

Assessing DxGPT: Diagnosing Rare Diseases with Various Large Language Models

Zero-shot Interpretable Phenotyping of Postpartum Hemorrhage Using Large Language Models

Harnessing generative AI to annotate the severity of all phenotypic abnormalities within the Human Phenotype Ontology