Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening

Ronald Luo,Ziya Sastimoglu,Abu Ilius Faisal,M. Jamal Deen
DOI: https://doi.org/10.1101/2024.06.03.24308405
2024-06-04
Abstract:Background Systematic reviews and meta-analyses are essential for informed research and policymaking, yet they are typically resource-intensive and time-consuming. Recent advances in artificial intelligence and machine learning offer promising opportunities to streamline these processes. Objective To enhance the efficiency of systematic reviews, we explored the automation of various stages using GPT-3.5 Turbo. We assessed the model's efficacy and performance by comparing it against three expert-conducted reviews across a comprehensive dataset of 24,534 studies. Methods The model's performance was evaluated through a comparison with three expert reviews, utilizing a pseudo-K-folds permutation and a one-tailed ANOVA with an alpha level of 0.05 to ensure statistical validity. Key performance metrics such as accuracy, sensitivity, specificity, predictive values, F1-score, and the Matthews correlation coefficient were analyzed using two sets of prompts. Results Our approach significantly streamlined the systematic review process, which typically takes a year, reducing it to a few hours without sacrificing quality. In the initial screening phase, accuracy, specificity, and negative predictive values ranged between 80% and 95%. Sensitivity improved markedly during the second screening phase, demonstrating the model's robustness when provided with more extensive data. Conclusion While ongoing refinements are needed, this tool represents a significant advancement in research methodologies, potentially making systematic reviews more accessible to a wider range of researchers.
What problem does this paper attempt to address?