Abstract:Background: Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients. Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend. Objective: We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT. Methods: We submitted 72 open-ended HSCT-related questions of variable difficulty to the LLMs and rated their responses based on consistency-defined as replicability of the response-response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience. Results: ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader. Conclusions: In conclusion, despite LLMs' potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs' ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application.

ChatGPT as a medical education resource in cardiology: Mitigating replicability challenges and optimizing model performance

Performance of ChatGPT on the MCAT: The Road to Personalized and Equitable Premedical Learning

ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study

ChatGPT and large language models (LLMs) awareness and use. A prospective cross-sectional survey of U.S. medical students

How Well Does ChatGPT Do When Taking the Medical Licensing Exams? The Implications of Large Language Models for Medical Education and Knowledge Assessment

Analyzing the Performance of ChatGPT in Cardiology and Vascular Pathologies

Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review

Evaluation of ChatGPT-generated medical responses: A systematic review and meta-analysis

The Role of Large Language Models in Medical Education: Applications and Implications

ChatGPT vs Medical Professional: Analyzing Responses to Laboratory Medicine Questions on Social Media

ChatGPT and Beyond: An overview of the growing field of large language models and their use in ophthalmology

Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians

Evaluating the Appropriateness, Consistency, and Readability of ChatGPT in Critical Care Recommendations

Effectiveness of ChatGPT in explaining complex medical reports to patients

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant.

Current Status of ChatGPT Use in Medical Education: Potentials, Challenges, and Strategies

Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health

Can we use ChatGPT for Mental Health and Substance Use Education? Examining Its Quality and Potential Harms

Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students

Performance of ChatGPT on USMLE: Unlocking the Potential of Large Language Models for AI-Assisted Medical Education