Leveraging large language models for systematic reviewing: A case study using HIV medication adherence research
M Naser Lessani,Zhenlong Li,Shan Qiao,Huan Ning,Abhishek Aggarwal,Guangzhe Frank Yuan,Atena Pasha,Michael Stirratt,Lori A. J. Scott-Sheldon
DOI: https://doi.org/10.1101/2024.09.18.24313828
2024-09-19
Abstract:Background: The rapidly accumulating scientific literature in HIV presents a significant challenge in accurately and efficiently assessing the relevant literature. This study explores the potential capabilities of using large language models (LLMs), such as ChatGPT, for selecting relevant studies for a systematic review.
Method: Scientific papers were initially obtained from bibliographic database searches using a Boolean search strategy with pre-defined keywords. From 15,839 unique records, three reviewers manually identified 39 relevant papers based on pre-specified inclusion and exclusion criteria. In the ChatGPT experiment, over 10% of records were randomly chosen as the experimental dataset, including the 39 manually identified manuscripts. These unique records (n=1,680) underwent screening via ChatGPT-4 using the same pre-specified criteria. Four strategies were employed including standard prompting, i.e., input-output (IO), chain of thought with zero-shot learning (0-CoT), CoT with few-shot learning (FS-CoT), and Majority Voting (which integrates all three promoting strategies). Performance of the models were assessed using recall, F-score, and precision measures.
Results: Recall scores (% of true abstracts successfully identified and retrieved by the model from all input data/records) for different ChatGPT configurations were 0.82 (IO), 0.97 (0-CoT), and both the FS-CoT and the Majority Voting prompts achieved a recall score of 1.0. F-scores were 0.34 (IO), 0.29 (0-CoT), 0.39 (FS-CoT), and 0.46 (majority voting). Precision measures were 0.22(IO), 0.17(0-CoT), 0.24(FS-CoT), and 0.30 (Majority Voting). Computational time varied with 2.32, 4.55, 6.44, and 13.30 hours for IO, 0-CoT, FS-CoT, and majority voting,respectively. Processing costs for the 1,680 unique records were approximately $63, $73, $186, and $325, respectively.
Conclusion: LLMs, like ChatGPT, are viable for systematic reviews, efficiently identifying studies meeting pre-specified criteria. Greater efficacy was observed when a more sophisticated prompt design was employed, integrating IO, 0-CoT and FS-CoT prompt techniques (i.e., majority voting). LLMs can expedite the study selection process in systematic reviews compared to manual methods, with minimal cost implications.