Large language models for conducting systematic reviews: on the rise, but not yet ready for use - a scoping review

Judith-Lisa Lieberum,Markus Toews,Maria-Inti Metzendorf,Felix Heilmeyer,Waldemar Siemens,Christian Haverkamp,Daniel Boehringer,Joerg J Meerpohl,Angelika Eisele-Metzger
DOI: https://doi.org/10.1101/2024.12.19.24319326
2024-12-24
Abstract:Background: Machine learning (ML) promises versatile help in the creation of systematic reviews (SRs). Recently, further developments in the form of large language models (LLMs) and their application in SR conduct attracted attention. Objective: To provide an overview of ML and specifically LLM applications in SR conduct in health research. Study design: We systematically searched MEDLINE, Web of Science, IEEEXplore, ACM Digital Library, Europe PMC (preprints), Google Scholar, and conducted an additional hand search (last search: 26 February 2024). We included scientific articles in English or German, published from April 2021 onwards, building upon the results of a mapping review with a related research question. Two reviewers independently screened studies for eligibility; after piloting, one reviewer extracted data, checked by another. Results: Our database search yielded 8054 hits, and we identified 33 articles from our hand search. Of the 196 included reports, 159 described more traditional ML techniques, 37 focused on LLMs. LLM approaches covered 10 of 13 defined SR steps, most frequently literature search (n=15, 41%), study selection (n=14, 38%), and data extraction (n=11, 30%). The mostly recurring LLM was GPT (n=33, 89%). Validation studies were predominant (n=21, 57%). In half of the studies, authors evaluated LLM use as promising (n=20, 54%), one quarter as neutral (n=9, 24%) and one fifth as non-promising (n=8, 22%). Conclusions: Although LLMs show promise in supporting SR creation, fully established or validated applications are often lacking. The rapid increase in research on LLMs for evidence synthesis production highlights their growing relevance.
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? The paper published by Lieberum et al. in 2024 aims to explore the current application status and potential of large language models (LLMs) in systematic reviews (SRs). Specifically, the objectives of this study are: 1. **Provide an overview**: Provide a comprehensive overview of the application of machine learning (ML), especially large language models (LLMs), in conducting systematic reviews in health research. 2. **Identify application areas**: Determine the specific steps that LLMs can support in the systematic review process. According to the paper, LLMs have covered 10 out of 13 defined systematic review steps, among which the most common applications include literature search (41%), study screening (38%) and data extraction (30%). 3. **Evaluate feasibility**: Evaluate the actual application effects of these models in systematic reviews by analyzing existing literature. The study found that although most studies consider LLMs to have potential value (54% of the studies think they have broad prospects), there are also some limitations and challenges. For example, there are more verification studies, but fully mature and verified applications are still relatively few. 4. **Promote future research**: Based on the current research results, provide a basis and direction for further future research and applications, especially on how to improve and optimize the application of LLMs in systematic reviews. In summary, the core problem of this paper is to explore and evaluate the current application status and future potential of large language models in systematic reviews, in order to provide valuable insights for the further development of this field.