Abstract:Background: Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients. Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend. Objective: We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT. Methods: We submitted 72 open-ended HSCT-related questions of variable difficulty to the LLMs and rated their responses based on consistency-defined as replicability of the response-response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience. Results: ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader. Conclusions: In conclusion, despite LLMs' potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs' ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application.

Accuracy and Completeness of Bard and Chat-GPT 4 Responses for Questions Derived from the International Consensus Statement on Endoscopic Skull-Base Surgery 2019

Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard

Comparing the Efficacy of Large Language Models ChatGPT, Bard, and Bing AI in Providing Information on Rhinoplasty: An Observational Study

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Comparative Performance of Current Patient-Accessible Artificial Intelligence Large Language Models in the Preoperative Education of Patients in Facial Aesthetic Surgery

A Comparative Analysis of ChatGPT and Google’s AI’s “Bard” in Medicine

Assessing ChatGPT's Responses to Otolaryngology Patient Questions

Can ChatGPT help patients answer their otolaryngology questions?

Utility and Comparative Performance of Current Artificial Intelligence Large Language Models as Postoperative Medical Support Chatbots in Aesthetic Surgery

Harnessing Artificial Intelligence in Bariatric Surgery: Comparative Analysis of ChatGPT-4, Bing, and Bard in Generating Clinician-Level Bariatric Surgery Recommendations

Large language models encode medical oncology knowledge: Performance on the ASCO and ESMO examination questions.

Comparative Performance of ChatGPT and Bard in a Text-Based Radiology Knowledge Assessment

Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions

Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models

Comparing ChatGPT and a Single Anesthesiologist's Responses to Common Patient Questions: An Exploratory Cross-Sectional Survey of a Panel of Anesthesiologists

To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Evaluating the Current Ability of ChatGPT to Assist in Professional Otolaryngology Education

Google Bard and ChatGPT in Orthopedics: Which Is the Better Doctor in Sports Medicine and Pediatric Orthopedics? The Role of AI in Patient Education

Large language models in pathology: A comparative study of ChatGPT and bard with pathology trainees on multiple-choice questions