Performance of a trained large language model to provide clinical trial recommendation in a head and neck cancer population.

Tony Hung,Gilad Kuperman,Eric Jeffrey Sherman,Alan Loh Ho,Winston Wong,Anuja Kriplani,Lara Dunn,James Vincent Fetten,Loren S. Michel,Shrujal S. Baxi,Chunhua Weng,David G. Pfister,Jun J. Mao
DOI: https://doi.org/10.1200/jco.2024.42.16_suppl.11081
IF: 45.3
2024-06-01
Journal of Clinical Oncology
Abstract:11081 Background: Chatbots based on large language model (LLM) have demonstrated ability to answer oncology exam questions; however, leveraging LLM in medical-decision support have not yet demonstrated suitable performance in oncology practice. We evaluated the performance of a trained a LLM, GPT-4, to recommend appropriate clinical trials for a head & neck (HN) cancer population. Methods: In 2022, we developed an artificial intelligence powered clinical trial management mobile app, LookUpTrials, and demonstrated promising user engagement among oncologists. Using LookUpTrials database, we applied direct preference optimization to train GPT-4 as an in-app assistant to LookUpTrials. From Nov 7 to Dec 19, 2023, we collected consecutive, new patient cases and their respective clinical trial recommendations from oncologists in the HN medical oncology service at Memorial Sloan Kettering Cancer Center. Cases were categorized by diagnosis, cancer stage, treatment setting, and physician recommendation on clinical trials. Trained GPT-4 is prompted using a semi-structured template: “Given patient with a , , , what are possible clinical trials?” Physician recommendations were compared with trained GPT-4 responses. We analyzed the performance of GPT-4 based on its response precision (positive predictive value), recall (sensitivity), and F1 score (harmonic mean of precision and recall). Results: We analyzed 178 patient cases, mean age 65.6 (SD 13.9), primarily male (75%) with local/locally advanced (68%) HN (61%), thyroid (16%), skin (9%), or salivary (8%) cancers. Majority were treated in the definitive setting with combined modality therapy (42%) and modest proportion were treated under clinical trials (10%). Overall, trained GPT-4 achieved a moderate performance matching physician clinical trial recommendations with 63% precision and 100% recall (F1 score 0.77), narrowing a total list of 56 HN clinical trials to a range of 0-4 relevant trials per patient case (mean 1, SD 1.2). Comparatively, performance of our trained GPT-4 exceeded historic performance of untrained LLMs to provide oncology treatment recommendation by 4-20 folds (F1 score 0.04 - 0.19). Conclusions: This proof-of-concept study demonstrated that trained LLM can achieve moderate performance in matching physician clinical trial recommendation in HN oncology. Our results suggest the potential of embedding trained LLM into oncology workflow to aid clinical trial search and accelerate clinical trial accrual. Future research is needed to optimize precision of trained LLM and to assess whether trained LLM may be a scalable solution to enhance the diversity and equity of clinical trial participation.
oncology
What problem does this paper attempt to address?