Accelerating the pace and accuracy of systematic reviews using AI: a validation study

Jiada Zhan,Kara Suvada,Muwu Xu,Wenya Tian,Kelly C Cara,Taylor C Wallace,Mohammed K. Ali
DOI: https://doi.org/10.1101/2024.12.10.24318803
2024-12-11
Abstract:Background: Artificial intelligence (AI) can greatly enhance efficiency in systematic literature reviews and meta-analyses, but its accuracy in screening titles/abstracts and full-text articles is uncertain. Objectives: This study evaluated the performance metrics (sensitivity, specificity) of a GPT-4 AI program, Review Copilot, against human decisions (gold standard) in screening titles/abstracts and full-text articles from four published systematic reviews/meta-analyses. Research Design: Participant data from four already-published systematic literature reviews were used for this validation study. This was a study comparing Review Copilot to human decision-making (gold standard) in screening titles/abstracts and full-text articles for systematic reviews/meta-analyses. The four studies that were used in this study included observational studies and randomized control trials. Review Copilot operates on the OpenAI, GPT-4 server. We examined the performance metrics of Review Copilot to include and exclude titles/abstracts and full-text articles as compared to human decisions in four systematic reviews/meta-analyses. Sensitivity, specificity, and balanced accuracy of title/abstract and full-text screening were compared between Review Copilot and human decisions. Results: Review Copilot's sensitivity and specificity for title/abstract screening were 99.2% and 83.6%, respectively, and 97.6% and 47.4% for full-text screening. The average agreement between two runs was 95.4%, with a kappa statistic of 0.83. Review Copilot screened in one-quarter of the time compared to humans. Conclusions: AI use in systematic reviews and meta-analyses is inevitable. Health researchers must understand these technologies' strengths and limitations to ethically leverage them for research efficiency and evidence-based decision-making in health.
What problem does this paper attempt to address?