Abstract:Background ChatGPT, a publicly available artificial intelligence (AI) large language model, has allowed for sophisticated AI technology on demand. Indeed, use of ChatGPT has already begun to make its way into medical research. However, the medical community has yet to understand the capabilities and ethical considerations of AI within this context, and unknowns exist regarding ChatGPT's writing abilities, accuracy, and implications for authorship. Objectives We hypothesize that human reviewers and AI detection software differ in their ability to correctly identify original published abstracts and AI-written abstracts in the subjects of Gynecology and Urogynecology. We additionally suspect that concrete differences in writing errors, readability, and perceived writing quality exist between original and AI-generated text. Study Design Twenty-five articles published in high impact medical journals and a collection of Gynecology and Urogynecology journals were selected. ChatGPT was prompted to write 25 corresponding AI-generated abstracts, providing the abstract title, journal-dictated abstract requirements, and select original results. The original and AI-generated abstracts were reviewed by blinded Gynecology and Urogynecology faculty and fellows to identify the writing as original or AI-generated. All abstracts were analyzed by publicly available AI detection software GPTZero, Originality, and Copyleaks and were assessed for writing errors and quality by AI writing assistant Grammarly. Results One hundred fifty-seven reviews of 25 original and 25 AI-generated abstracts were conducted by 26 faculty and 4 fellows. Fifty-seven percent of original abstracts and 42.3% of AI-generated abstracts were correctly identified for an average of 49.7% across all abstracts. All three AI detectors rated the original abstracts as less likely be AI-written than the ChatGPT-generated abstracts (GPTZero 5.8 vs 73.3%, p<0.001; Originality 10.9 vs 98.1%, p<0.001; Copyleaks 18.6 vs 58.2%, p<0.001). The performance of the three AI detection software differed when analyzing all abstracts (p=0.03), original abstracts (p<0.001), and AI-generated abstracts (p<0.001). Grammarly text analysis identified more writing issues and correctness errors in original than AI abstracts, including lower Grammarly score reflective of poorer writing quality (82.3 vs 88.1, p=0.006), more total writing issues (19.2 vs 12.8, p<0.001), critical issues (5.4 vs 1.3, p<0.001), confusing words (0.8 vs 0.1, p=0.006), misspelled words (1.7 vs 0.6, p=0.02), incorrect determiner use (1.2 vs 0.2, p=0.002), and comma misuse (0.3 vs 0.0, p=0.005). Conclusions Human reviewers are unable to detect the subtle differences between human and ChatGPT-generated scientific writing due to AI's ability to generate tremendously realistic text. AI detection software improve identification of AI-generated writing but still lack complete accuracy and require programmatic improvements in order to achieve optimal detection. As reviewers and editors may be unable to reliably detect AI-generated pieces, clear guidelines for reporting AI use by authors and implementing AI detection software in the review process will need to be established as AI chatbots gain more widespread use.

Human versus artificial intelligence‐generated arthroplasty literature: A single‐blinded analysis of perceived communication, quality, and authorship source

Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis

Identification of ChatGPT-Generated Abstracts Within Shoulder and Elbow Surgery Poses a Challenge for Reviewers

Editorial Commentary: Experts in Shoulder Surgery Do Not Consistently Detect Artificial Intelligence-Generated Scientific Abstracts

Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis

AI discernment in foot and ankle surgery research: A survey investigation

Rise of the Machines: The Prevalence and Disclosure of Artificial Intelligence–Generated Text in High-Impact Orthopaedic Journals

Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest

Human versus machine: identifying ChatGPT-generated abstracts in Gynecology and Urogynecology

Editorial Commentary: Biomedical Research Investigating Artificial Intelligence Large Language Models Needs to Move Beyond Measuring Accuracy and Focus on Improving Patient Care

Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers

Abstracts Written by Medical Researchers vs Generated by Large Language Models

An evaluation of AI generated literature reviews in musculoskeletal radiology

Artificial Intelligence-Generated Editorials in Radiology: Can Expert Editors Detect Them?

It Cannot Be Right If It Was Written by AI: On Lawyers' Preferences of Documents Perceived as Authored by an LLM vs a Human

Artificial Intelligence Large Language Models Address Anterior Cruciate Ligament Reconstruction: Superior Clarity and Completeness by Gemini Compared With ChatGPT-4 in Response to American Academy of Orthopaedic Surgeons Clinical Practice Guidelines

The great detectives: humans versus AI detectors in catching large language model-generated medical writing

The impact of text topic and assumed human vs. AI authorship on competence and quality assessment

Digital Ink and Surgical Dreams: Perceptions of Artificial Intelligence-Generated Essays in Residency Applications

From technical to understandable: Artificial Intelligence Large Language Models improve the readability of knee radiology reports

Residency Application Selection Committee Discriminatory Ability in Identifying Artificial Intelligence-Generated Personal Statements