Comparing the Efficacy of Large Language Models ChatGPT, Bard, and Bing AI in Providing Information on Rhinoplasty: An Observational Study

Ishith Seth,Bryan Lim,Yi Xie,Jevan Cevik,Warren M Rozen,Richard J Ross,Mathew Lee
DOI: https://doi.org/10.1093/asjof/ojad084
2023-09-14
Aesthetic Surgery Journal Open Forum
Abstract:Abstract Background Large language models (LLMs) are emerging artificial intelligence (AI) technology refining research and healthcare. The impact of these models on presurgical planning and education remains under-explored. Objectives This study aims to assess 3 prominent LLMs – Google’s AI BARD (Mountain View, CA), Bing’s AI (Microsoft; Redmond, WA), and ChatGPT-3.5 (Open AI; San Francisco, CA) in providing safe medical information for rhinoplasty. Methods Six questions regarding rhinoplasty were prompted to ChatGPT, BARD, and Bing AI. A Likert scale was used to evaluate these responses by a panel of Specialist Plastic and Reconstructive Surgeons with extensive experience in rhinoplasty. To measure reliability the Flesch Reading Ease Score, the Flesch-Kincaid Grade Level, and the Coleman-Liau Index were used. The modified DISCERN score was chosen as the criterion for assessing suitability and reliability. Student’s t-test was performed to calculate the difference between the LLMs and a double-sided P value < 0.05 was considered statistically significant. Results Reliability-wise, BARD and ChatGPT demonstrated significantly (P<0.05) greater Flesch Reading Ease Score of 47.47 (±15.32) and 37.68 (±12.96), Flesch-Kincaid Grade Level of 9.7 (±3.12) and 10.15 (±1.84), and Coleman-Liau Index of 10.83 (±2.14) and 12.17 (±1.17) than Bing AI. Suitability-wise, BARD (46.3 ±2.8) demonstrated a significantly greater DISCERN score than ChatGPT and Bing AI. Likert score-wise, ChatGPT and BARD demonstrated similar scores and were greater than Bing AI. Conclusions BARD delivered the most succinct and comprehensible information, followed by ChatGPT and BingAI. Although these models demonstrate potential, challenges remain regarding depth and specificity. Future research should aim to augment LLM performance through the integration of specialized databases and expert knowledge, while also refining their algorithms.
What problem does this paper attempt to address?