Potential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio-Oncology
Pengfei Li,Xuejuan Zhang,Erjia Zhu,Shijun Yu,Bin Sheng,Yih Chung Tham,Tien Yin Wong,Hongwei Ji
DOI: https://doi.org/10.1161/jaha.123.033584
IF: 6.106
2024-01-01
Journal of the American Heart Association
Abstract:HomeJournal of the American Heart AssociationAhead of PrintPotential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio‐Oncology Open AccessRapid CommunicationPDF/EPUBAboutView PDFView EPUBSections ToolsAdd to favoritesDownload citationsTrack citationsPermissions ShareShare onFacebookTwitterLinked InMendeleyReddit Jump toOpen AccessRapid CommunicationPDF/EPUBPotential Multidisciplinary Use of Large Language Models for Addressing Queries in Cardio‐Oncology Pengfei Li, Xuejuan Zhang, Erjia Zhu, Shijun Yu, Bin Sheng, Yih Chung Tham, Tien Yin Wong and Hongwei Ji Pengfei LiPengfei Li https://orcid.org/0009-0007-0482-5316 , Department of General Medicine, , The Affiliated Hospital of Qingdao University, , Qingdao, , China, , Xuejuan ZhangXuejuan Zhang , Department of General Medicine, , The Affiliated Hospital of Qingdao University, , Qingdao, , China, , Erjia ZhuErjia Zhu , Department of Thoracic Surgery, , Shanghai Pulmonary Hospital, Tongji University School of Medicine, , Shanghai, , China, , Shijun YuShijun Yu , Department of Oncology, , Shanghai East Hospital, Tongji University School of Medicine, , Shanghai, , China, , Bin ShengBin Sheng , Department of Computer Science and Engineering, , Shanghai Jiao Tong University, , Shanghai, , China, , Yih Chung ThamYih Chung Tham , Yong Loo Lin School of Medicine, , National University of Singapore, , Singapore City, , Singapore, , Singapore National Eye Center, Singapore Eye Research Institute, Singapore, Singapore; Duke‐NUS Medical School, , Singapore City, , Singapore, , Tien Yin WongTien Yin Wong * Correspondence to: Tien Yin Wong and Hongwei Ji, Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, Tsinghua University, Haidian District No. 30 Shuangqing Road, Beijing, 100084, China. Email: E-mail Address: [email protected], E-mail Address: [email protected] https://orcid.org/0000-0002-8448-1264 , Singapore National Eye Center, Singapore Eye Research Institute, Singapore, Singapore; Duke‐NUS Medical School, , Singapore City, , Singapore, , Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, , Tsinghua University, , Beijing, , China, and Hongwei JiHongwei Ji * Correspondence to: Tien Yin Wong and Hongwei Ji, Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, Tsinghua University, Haidian District No. 30 Shuangqing Road, Beijing, 100084, China. Email: E-mail Address: [email protected], E-mail Address: [email protected] https://orcid.org/0000-0003-3657-4666 , Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, , Tsinghua University, , Beijing, , China, Originally published18 Mar 2024https://doi.org/10.1161/JAHA.123.033584Journal of the American Heart Association. 2024;0:e9417In the crossroads of digital health and education, large language models (LLMs) emerge as tools with great potential.1 Trained on expansive textual data sets, these state‐of‐the‐art artificial intelligence models can generate multidisciplinary content, answer intricate queries, and accelerate information delivery.1 Particularly in the field of cardio‐oncology, which combines cardiac and oncological expertise, LLMs have the potential to provide valuable insights to specialists like cardiologists and oncologists.2 This is useful in situations in which standard guidelines are not immediately available or when there is a need to combine a vast amount of interdisciplinary information. However, the performances of LLMs in this context remains largely unknown. This study aims to benchmark these state‐of‐the‐art artificial intelligence models in their ability to handle the interdisciplinary queries inherent in cardio‐oncology, where integrative insights from cardiology and oncology are crucial.The data that support the findings of this study are available from the last author upon reasonable request.Our study, conducted between October 02, 2023 and October 12, 2023 compiled 25 questions according to the 2022 European Society of Cardiology guideline on cardio‐oncology3 (Table). Each query was individually and independently posed to 5 LLMs: ChatGPT‐3.5, ChatGPT‐4.0, Bard, Llama 2, and Claude 2, generating a total of 25 responses per chatbot. We format all generated responses as plain text and stripped of any identifying details (eg, remarks like "I'm not a doctor" from ChatGPT). Responses were randomly shuffled within their respective question sets, ensuring that the reviewers remained unaware of LLM‐specific responses. Two experienced attending‐level physicians independently assessed the responses in 5 separate rounds, each conducted on a distinct day, with an overnight washout period to minimize memory bias (Table). This study did not involve human subjects; institutional review board approval and informed consent were waived.Table . Performance of Large Language Models in Addressing Patient Queries Regarding Cardio‐OncologyGPT‐3.5GPT‐4BardLlama 2Claude 2P valueWord count, mean±SD386±91386±96340±78360±9620327<0.001Good response, n (%)13 (52)17 (68)13 (52)12 (48)13 (52)0.6531. DefinitionWhat is cardio‐oncology?√√√√√‐What is cancer therapy–related cardiovascular toxicity?√√√√‐What are the cardiovascular risk factors for cancer survivors?√√√√√‐What is onco‐hypertension?√√‐Why is cardiovascular risk stratification before cancer surgery important?√√√√√‐2. DiagnosisWhat cardiac biomarkers are commonly used for the diagnosis of cancer therapy–related cardiovascular toxicity?√√√√‐How to diagnose immune checkpoint inhibitor–associated myocarditis?√√√√‐How to diagnose cancer therapy–related cardiac dysfunction?√√√√√‐What is the classical noninvasive method used to diagnose amyloid light‐chain cardiac amyloidosis?√√‐How to diagnose cancer‐related Takotsubo syndrome?√√√√‐3. TreatmentWhat are the most common anticancer drugs that induce cardiovascular toxicity?√√√√‐What is the recommended threshold for asymptomatic hypertension treatment for cancer survivors?‐How to restart QTc‐prolonging cancer therapy?√‐Should cancer treatment be interrupted for patients with acute coronary syndrome?‐How to treat venous thromboembolism for patients with cancer?√√‐4. PreventionHow to prevent cancer therapy–related cardiovascular toxicity?√‐How to prevent immune checkpoint inhibitor–associated myocarditis?‐How to prevent cancer therapy–related cardiac dysfunction?‐How to prevent cancer‐related Takotsubo syndrome?√√‐How to prevent hypertension for cancer patients?√√√√√‐5. Special populationAre elderly patients at higher or lower risk of cancer therapy–related cardiovascular toxicity?√√√√‐Is female sex a risk factor or protective factor for cancer therapy–related cardiovascular toxicity? Why?√√‐How to start chemotherapy for pregnant patients with cancer?√√√‐What is the cardiovascular risk for patients with multiple cancers?√√√‐What is the difference between management strategies for patients with benign and malignant cardiac tumors?√‐Responses were categorized as good, borderline, or poor on the basis of clinical accuracy, relevance to the query, and adherence to the 2022 European Society of Cardiology guideline on cardio‐oncology. Specifically, responses were designated as good if they were error free, borderline if they contain potential factual inaccuracies, and poor if they were inaccurate and contained factual inaccuracies. The proportion of good responses were compared using the χ2 test. The final grade assigned to each LLM's response was determined through a consensus approach by 2 attending‐level physicians. A consultant‐level cardiologist determined the final grade when responses had different grades from the 2 physicians.√ Indicates good response.The mean±SD of the word count was 386±91 for ChatGPT‐3.5, 386±96 for ChatGPT‐4.0, 340±78 for Google Bard, 360±96 for Meta Llama 2, and 203±27 for Anthropic Claude 2 (P<0.001). The preliminary results indicated that ChatGPT‐4 provided 17 of 25 (68%) appropriate responses, followed by Bard, Claude 2, and ChatGPT‐3.5 with 13 of 25 (52%), and Llama 2 with 12 of 25 (48%; P=0.653). A notable area of concern was that in the treatment and prevention domain; all 5 LLM‐Chatbots earned either borderline or poor scores. One example is that LLM chatbots failed to align with the latest guideline: In response to the question, "Should cancer treatment be interrupted for patients with acute coronary syndrome?," LLM chatbots suggested that treatment interruption should depend on the severity of acute coronary syndrome. However, the 2022 European Society of Cardiology guideline suggests that in the context of a patient with both cancer and acute coronary syndrome, cancer treatment should indeed be temporarily interrupted.3Among the 5 evaluated LLM chatbots, though significant space for improvement persisted, ChatGPT‐4.0 provided 68% good responses in handling queries related to cardio‐oncology, exceeding other LLMs. Our study served as 1 of few pioneer benchmarking studies for LLM. While most previous studies focused primarily on ChatGPT‐3.5 within a specific discipline,4 our study comprehensively examined 5 state‐of‐the‐art LLMs, along with a research orientation set at the intersection of oncology and cardiology. The observed superior performance of ChatGPT‐4.0 may be due to its hugely expansive parameter set, with continuous feedback from a large number of users and experts to inform its training and reasoning.5 The LLMs performed better in definition and diagnosis than in the treatment and prevention domain. This discovery may be due to the outdated training data sets, which potentially misaligned with the latest progress and guidelines in cardio‐oncology management. This underscored the necessity for continuous LLM updates. Our results not only suggest the potential of LLMs as supportive tools in clinical decision making but also emphasize the ongoing need for oversight by human physicians. Looking ahead, the potential use of LLMs in cardio‐oncology may represent a step toward a more data‐informed and precision‐oriented approach in both research and clinical practice. Our study benefited from a rigorous study design, including proper randomization, washout periods, and a consensus scoring methodology. However, there were limitations. First, it should be noted that there is the concern of potential false information generation (eg, hallucination), highlighting the need for strategies to mitigate risks, such as encouraging users to consult multiple sources for comprehensive information. Second, although we curated questions on the basis of guidelines and clinical experience of specialized physicians, these questions only represent a small part of the real‐world queries. Furthermore, given the rapid evolution of the LLM domain and dynamic nature of lifelong learning, ensuing evaluations addressing the time‐sensitive attributes of these models are imperative. In conclusion, among 5 leading LLMs responding to a range of cardio‐oncology queries, ChatGPT‐4.0 exceeds other LLMs, with 68% of responses graded as good, shedding light on both capabilities and limitations of LLMs in the intersection of these complex medical fields. Ongoing updates and, where possible, fine‐tuning systems to align with the latest advancements in cardio‐oncology management are needed.Sources of FundingThis study was funded in part by the National Key Research and Development Program of China (2022YFC2502800), the National Natural Science Foundation of China (82103908), the Shandong Provincial Natural Science Foundation (ZR2021QH014), the Shuimu Scholar Program of Tsinghua University, and the National Postdoctoral Innovative Talent Support Program (BX20230189). The funding sources had no role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.DisclosuresNone.Footnotes* Correspondence to: Tien Yin Wong and Hongwei Ji, Tsinghua Medicine, Beijing Tsinghua Changgung Hospital, Tsinghua University, Haidian District No. 30 Shuangqing Road, Beijing, 100084, China. Email: wongtienyin@tsinghua.edu.cn, hongweijicn@gmail.com*P. Li and X. Zhang contributed equally.This manuscript was sent to Tochukwu M. Okwuosa, DO, Associate Editor, for review by expert referees, editorial decision, and final disposition.For Sources of Funding and Disclosures, see page 3.References1 Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023; 29:1930–1940. doi: 10.1038/s41591-023-02448-8CrossrefMedlineGoogle Scholar2 Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole‐Lewis H, Pfohl S, et al. Large language models encode clinical knowledge. Nature. 2023; 620:172–180. doi: 10.1038/s41586-023-06291-2CrossrefMedlineGoogle Scholar3 Lyon AR, López‐Fernández T, Couch LS, Asteggiano R, Aznar MC, Bergler‐Klein J, Boriani G, Cardinale D, Cordoba R, Cosyns B, et al; ESC Scientific Document Group . 2022 ESC Guidelines on cardio‐oncology developed in collaboration with the European Hematology Association (EHA), the European Society for Therapeutic Radiology and Oncology (ESTRO) and the International Cardio‐Oncology Society (IC‐OS). Eur Heart J. 2022; 43:4229–4361. doi: 10.1093/eurheartj/ehac244CrossrefMedlineGoogle Scholar4 Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat‐based artificial intelligence model. JAMA. 2023; 329:842–844. doi: 10.1001/jama.2023.1044CrossrefMedlineGoogle Scholar5 OpenAI R . GPT‐4 technical report. arXiv. 2023:2303.08774.Google Scholar eLetters(0)eLetters should relate to an article recently published in the journal and are not a forum for providing unpublished data. Comments are reviewed for appropriate use of tone and language. Comments are not peer-reviewed. Acceptable comments are posted to the journal website only. Comments are not published in an issue and are not indexed in PubMed. Comments should be no longer than 500 words and will only be posted online. References are limited to 10. Authors of the article cited in the comment will be invited to reply, as appropriate.Comments and feedback on AHA/ASA Scientific Statements and Guidelines should be directed to the AHA/ASA Manuscript Oversight Committee via its Correspondence page.Sign In to Submit a Response to This Article Previous Back to top Next FiguresReferencesRelatedDetails Article InformationMetrics Copyright © 2024 The Authors. Published on behalf of the American Heart Association, Inc., by Wiley BlackwellThis is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.https://doi.org/10.1161/JAHA.123.033584PMID: 38497458 Manuscript receivedNovember 16, 2023Manuscript acceptedFebruary 14, 2024Originally publishedMarch 18, 2024 Keywordscardio‐oncologyChatGPTClaude 2Google Bardlarge language modelsPDF download SubjectsDigital Health