Abstract:Abstract Background The application of artificial intelligence (AI) in academic writing has raised concerns regarding accuracy, ethics, and scientific rigour. Some AI content detectors may not accurately identify AI-generated texts, especially those that have undergone paraphrasing. Therefore, there is a pressing need for efficacious approaches or guidelines to govern AI usage in specific disciplines. Objective Our study aims to compare the accuracy of mainstream AI content detectors and human reviewers in detecting AI-generated rehabilitation-related articles with or without paraphrasing. Study design This cross-sectional study purposively chose 50 rehabilitation-related articles from four peer-reviewed journals, and then fabricated another 50 articles using ChatGPT. Specifically, ChatGPT was used to generate the introduction, discussion, and conclusion sections based on the original titles, methods, and results. Wordtune was then used to rephrase the ChatGPT-generated articles. Six common AI content detectors (Originality.ai, Turnitin, ZeroGPT, GPTZero, Content at Scale, and GPT-2 Output Detector) were employed to identify AI content for the original, ChatGPT-generated and AI-rephrased articles. Four human reviewers (two student reviewers and two professorial reviewers) were recruited to differentiate between the original articles and AI-rephrased articles, which were expected to be more difficult to detect. They were instructed to give reasons for their judgements. Results Originality.ai correctly detected 100% of ChatGPT-generated and AI-rephrased texts. ZeroGPT accurately detected 96% of ChatGPT-generated and 88% of AI-rephrased articles. The areas under the receiver operating characteristic curve (AUROC) of ZeroGPT were 0.98 for identifying human-written and AI articles. Turnitin showed a 0% misclassification rate for human-written articles, although it only identified 30% of AI-rephrased articles. Professorial reviewers accurately discriminated at least 96% of AI-rephrased articles, but they misclassified 12% of human-written articles as AI-generated. On average, students only identified 76% of AI-rephrased articles. Reviewers identified AI-rephrased articles based on ‘incoherent content’ (34.36%), followed by ‘grammatical errors’ (20.26%), and ‘insufficient evidence’ (16.15%). Conclusions and relevance This study directly compared the accuracy of advanced AI detectors and human reviewers in detecting AI-generated medical writing after paraphrasing. Our findings demonstrate that specific detectors and experienced reviewers can accurately identify articles generated by Large Language Models, even after paraphrasing. The rationale employed by our reviewers in their assessments can inform future evaluation strategies for monitoring AI usage in medical education or publications. AI content detectors may be incorporated as an additional screening tool in the peer-review process of academic journals.

Gotcha GPT: Ensuring the Integrity in Academic Writing

Academics' perceptions of ChatGPT-generated written outputs: A practical application of Turing’s Imitation Game

Distinguishing academic science writing from humans or ChatGPT with over 99% accuracy using off-the-shelf machine learning tools

ChatGPT and the Future of Academic Integrity in the Artificial Intelligence Era: A New Frontier

From human writing to artificial intelligence generated text: examining the prospects and potential threats of ChatGPT in academic writing

Et Machina: Exploring the Use of Conversational Agents Such as ChatGPT in Scientific Writing

Classification of Human- and AI-Generated Texts: Investigating Features for ChatGPT

The imitation game: Detecting human and AI-generated texts in the era of ChatGPT and BARD

Academic integrity and artificial intelligence: is ChatGPT hype, hero or heresy?

The great detectives: humans versus AI detectors in catching large language model-generated medical writing

Can linguists distinguish between ChatGPT/AI and human writing?: A study of research ethics and academic publishing

ChatGPT or academic scientist? Distinguishing authorship with over 99% accuracy using off-the-shelf machine learning tools

ChatGPT versus human essayists: an exploration of the impact of artificial intelligence for authorship and academic integrity in the humanities

ChatGPT in Academic Writing: A Threat to Human Creativity and Academic Integrity? An Exploratory Study

Health literacy in ChatGPT: exploring the potential of the use of artificial intelligence to produce academic text

Enhancing English abstract quality for non-English speaking authors using ChatGPT: A comparative study of Taiwan, Japan, China, and South Korea with slope graphs

Quantitative Analysis of AI-Generated Texts in Academic Research: A Study of AI Presence in Arxiv Submissions using AI Detection Tool

Evaluating AI and Human Authorship Quality in Academic Writing through Physics Essays

Academic integrity in the age of Artificial Intelligence (AI) authoring apps

Detecting AI-generated essays: the ChatGPT challenge

Admissions in the age of AI: detecting AI-generated application materials in higher education