Human versus artificial intelligence‐generated arthroplasty literature: A single‐blinded analysis of perceived communication, quality, and authorship source

Kyle W. Lawrence,Akram A. Habibi,Spencer A. Ward,Claudette M. Lajam,Ran Schwarzkopf,Joshua C. Rozell
DOI: https://doi.org/10.1002/rcs.2621
2024-02-14
International Journal of Medical Robotics and Computer Assisted Surgery
Abstract:Abstract Background Large language models (LLM) have unknown implications for medical research. This study assessed whether LLM‐generated abstracts are distinguishable from human‐written abstracts and to compare their perceived quality. Methods The LLM ChatGPT was used to generate 20 arthroplasty abstracts (AI‐generated) based on full‐text manuscripts, which were compared to originally published abstracts (human‐written). Six blinded orthopaedic surgeons rated abstracts on overall quality, communication, and confidence in the authorship source. Authorship‐confidence scores were compared to a test value representing complete inability to discern authorship. Results Modestly increased confidence in human authorship was observed for human‐written abstracts compared with AI‐generated abstracts ( p = 0.028), though AI‐generated abstract authorship‐confidence scores were statistically consistent with inability to discern authorship ( p = 0.999). Overall abstract quality was higher for human‐written abstracts ( p = 0.019). Conclusions AI‐generated abstracts' absolute authorship‐confidence ratings demonstrated difficulty in discerning authorship but did not achieve the perceived quality of human‐written abstracts. Caution is warranted in implementing LLMs into scientific writing.
surgery
What problem does this paper attempt to address?