Abstract:Summary Background The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy. Objective This study aimed to assess and compare the answers offered by four LLMs: Google’s Bard, OpenAI’s ChatGPT-3.5, and ChatGPT-4, and Microsoft’s Bing, in response to clinically relevant questions within the field of orthodontics. Materials and methods Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman’s and Wilcoxon’s tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance. Results Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance. Limitations The questions asked were indicative and did not cover the entire field of orthodontics. Conclusions Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist’s essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.

Comparing the Efficacy of Large Language Models ChatGPT, Bard, and Bing AI in Providing Information on Rhinoplasty: An Observational Study

Utility and Comparative Performance of Current Artificial Intelligence Large Language Models as Postoperative Medical Support Chatbots in Aesthetic Surgery

Comparative Performance of Current Patient-Accessible Artificial Intelligence Large Language Models in the Preoperative Education of Patients in Facial Aesthetic Surgery

Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT "Wins" Rhinoplasty Consultations: Should We Be Worried?

Evaluation of Rhinoplasty Information from ChatGPT, Gemini, and Claude for Readability and Accuracy

Artificial Intelligence in Postoperative Care: Assessing Large Language Models for Patient Recommendations in Plastic Surgery

Harnessing Artificial Intelligence in Bariatric Surgery: Comparative Analysis of ChatGPT-4, Bing, and Bard in Generating Clinician-Level Bariatric Surgery Recommendations

Preoperative Patient Guidance and Education in Aesthetic Breast Plastic Surgery: A Novel Proposed Application of Artificial Intelligence Large Language Models

Large Language Models for Intraoperative Decision Support in Plastic Surgery: A Comparison between ChatGPT-4 and Gemini

Accuracy and Completeness of Bard and Chat-GPT 4 Responses for Questions Derived from the International Consensus Statement on Endoscopic Skull-Base Surgery 2019

Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content

Can Large Language Models Generate Outpatient Clinic Letters at First Consultation That Incorporate Complication Profiles From UK and USA Aesthetic Plastic Surgery Associations?

Aesthetic Surgery Advice and Counseling from Artificial Intelligence: A Rhinoplasty Consultation with ChatGPT

Comparative Performance of ChatGPT 3.5 and GPT4 on Rhinology Standardized Board Examination Questions

Evaluating the Adherence of Large Language Models to Surgical Guidelines: A Comparative Analysis of Chatbot Recommendations and North American Spine Society (NASS) Coverage Criteria

Evaluating Artificial Intelligence's Role in Teaching the Reporting and Interpretation of Computed Tomographic Angiography for Preoperative Planning of the Deep Inferior Epigastric Artery Perforator Flap

Analyzing Large Language Models’ Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard

Quantitative Comparison of Chatbots on Common Rhinology Pathologies

Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study

Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing

Artificial Intelligence for Anesthesiology Board–Style Examination Questions: Role of Large Language Models