Abstract:Importance: Multimodal artificial intelligence (AI) chatbots can process complex medical image and text-based information that may improve their accuracy as a clinical diagnostic and management tool compared with unimodal, text-only AI chatbots. However, the difference in medical accuracy of multimodal and text-only chatbots in addressing questions about clinical oncology cases remains to be tested. Objective: To evaluate the utility of prompt engineering (zero-shot chain-of-thought) and compare the competency of multimodal and unimodal AI chatbots to generate medically accurate responses to questions about clinical oncology cases. Design, setting, and participants: This cross-sectional study benchmarked the medical accuracy of multiple-choice and free-text responses generated by AI chatbots in response to 79 questions about clinical oncology cases with images. Exposures: A unique set of 79 clinical oncology cases from JAMA Network Learning accessed on April 2, 2024, was posed to 10 AI chatbots. Main outcomes and measures: The primary outcome was medical accuracy evaluated by the number of correct responses by each AI chatbot. Multiple-choice responses were marked as correct based on the ground-truth, correct answer. Free-text responses were rated by a team of oncology specialists in duplicate and marked as correct based on consensus or resolved by a review of a third oncology specialist. Results: This study evaluated 10 chatbots, including 3 multimodal and 7 unimodal chatbots. On the multiple-choice evaluation, the top-performing chatbot was chatbot 10 (57 of 79 [72.15%]), followed by the multimodal chatbot 2 (56 of 79 [70.89%]) and chatbot 5 (54 of 79 [68.35%]). On the free-text evaluation, the top-performing chatbots were chatbot 5, chatbot 7, and the multimodal chatbot 2 (30 of 79 [37.97%]), followed by chatbot 10 (29 of 79 [36.71%]) and chatbot 8 and the multimodal chatbot 3 (25 of 79 [31.65%]). The accuracy of multimodal chatbots decreased when tested on cases with multiple images compared with questions with single images. Nine out of 10 chatbots, including all 3 multimodal chatbots, demonstrated decreased accuracy of their free-text responses compared with multiple-choice responses to questions about cancer cases. Conclusions and relevance: In this cross-sectional study of chatbot accuracy tested on clinical oncology cases, multimodal chatbots were not consistently more accurate than unimodal chatbots. These results suggest that further research is required to optimize multimodal chatbots to make more use of information from images to improve oncology-specific medical accuracy and reliability.

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study

A Comparative Analysis of Responses of Artificial Intelligence Chatbots in Special Needs Dentistry

Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model

How well do large language model-based chatbots perform in oral and maxillofacial radiology?

Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia

Embracing the future—is artificial intelligence already better? A comparative study of artificial intelligence performance in diagnostic accuracy and decision‐making

Exploring the role of ChatGPT in clinical decision-making in otorhinolaryngology: a ChatGPT designed study

Comparative analysis of diagnostic accuracy in endodontic assessments: dental students vs. artificial intelligence

Accuracy of an Artificial Intelligence Chatbot’s Interpretation of Clinical Ophthalmic Images

A Blinded Comparison of Three Generative Artificial Intelligence Chatbots for Orthopaedic Surgery Therapeutic Questions

Evaluating the Accuracy of ChatGPT and Google BARD in Fielding Oculoplastic Patient Queries: A Comparative Study on Artificial versus Human Intelligence

Artificial intelligence in dental education: ChatGPT's performance on the periodontic in‐service examination

Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma

Assessing the performance of Bing Chat artificial intelligence: Dental exams, clinical guidelines, and patients' frequent questions

A comparative analysis of AI-based chatbots: Assessing data quality in orthognathic surgery related patient information

Innovating dental diagnostics: ChatGPT's accuracy on diagnostic challenges

Performance of AI Chatbots on Controversial Topics in Oral Medicine, Pathology, and Radiology

Performance of Multimodal Artificial Intelligence Chatbots Evaluated on Clinical Oncology Cases

Evaluation of AI-generated Responses by Different Artificial Intelligence Chatbots to the Clinical Decision-Making Case-Based Questions in Oral and Maxillofacial Surgery

Performance of Artificial Intelligence (AI)-Powered Chatbots in the Assessment of Medical Case Reports: Qualitative Insights From Simulated Scenarios

ChatGPT performance in prosthodontics: Assessment of accuracy and repeatability in answer generation