Abstract:e13609 Background: Precision oncology revolutionized cancer treatment by identifying molecular biomarkers to guide personalized care. The ever-growing body of medical literature presents a challenge for oncologists researching targeted therapies. While recent studies investigated large language models (LLMs) to streamline this process, LLM reliance on general rather than medical knowledge limits clinical relevance and trustworthiness. To address these limitations, we developed a retrieval augmented generation (RAG) system that integrates PubMed clinical studies, trial databases and oncological guidelines with LLMs to support targeted treatment recommendations. The Molecular Tumor Board (MTB) at the Center of Personalized Medicine (ZPM TUM ) guided and evaluated treatment options proposed by the LLM to assess their applicability for clinical decision support. Methods: We used 10 publicly accessible fictional patient cases with 7 tumor types and 59 distinct molecular alterations. Our LLM system MEREDITH (Medical Evidence Retrieval and Data Integration for Tailored Healthcare) consists of Google's Gemini Pro, enhanced with RAG and Chain-of-Thought (CoT) prompting. To establish a benchmark, clinical experts at ZPM TUM manually annotated the cases. Informed by MTB expert feedback, we iteratively improved our LLM system from a draft system relying on PubMed-indexed data to an enhanced system, which replicated expert annotation processes by incorporating oncology guidelines, drug availability and trial databases (ClinicalTrials.gov, QuickQueck.de). ZPM TUM assessed credibility and clinical relevance of manually annotated and LLM-generated recommendations. Patient-level data on (likely) pathogenic molecular alterations and recommended treatment options were summarized using median and interquartile range (IQR). Semantic similarity between LLM and clinician responses was assessed using cosine similarity of text vector embeddings; paired t-test evaluated significance. Results: The median of (likely) pathogenic molecular alterations per patient was 2.5 (IQR: 2-3). ZPM TUM identified a median of 2 treatment options per patient (IQR: 1-3), while the enhanced LLM identified a median of 4 (IQR: 3-5). MEREDITH proposed multiple relevant treatment suggestions, including therapies based on preclinical studies, and molecular interactions, for further assessment by the MTB. ZPM TUM prioritized the most suitable clinical option. The mean semantic textual similarity of LLM responses increased significantly from 0.69 in the draft system to 0.76 in the enhanced system (p <0.001). Thus, feedback from ZPM TUM enhanced the model's ability to align its responses with clinician thought processes. Conclusions: Leveraging expert thought processes to instruct LLMs holds promise as a novel decision support tool for precision oncology.

Almanac: Retrieval-Augmented Language Models for Clinical Medicine

Large language models encode clinical knowledge

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark

M-QALM: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering

Health system-scale language models are all-purpose prediction engines

Natural Language Programming in Medicine: Administering Evidence Based Clinical Workflows with Autonomous Agents Powered by Generative Large Language Models

Answering real-world clinical questions using large language model based systems

Large Language Models in Healthcare: A Comprehensive Benchmark

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Retrieval-Augmented and Knowledge-Grounded Language Models for Faithful Clinical Medicine

oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness

Emulating Human Cognitive Processes for Expert-Level Medical Question-Answering with Large Language Models

Towards Expert-Level Medical Question Answering with Large Language Models

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization

Large language models for precision oncology: Clinical decision support through expert-guided learning.

Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study

Benchmarking the Confidence of Large Language Models in Clinical Questions

Evaluating and Mitigating Limitations of Large Language Models in Clinical Decision Making

P-glycoprotein inhibition by the agricultural pesticide propiconazole and its hydroxylated metabolites: Implications for pesticide-drug interactions.