How good are large language models for automated data extraction from randomized trials?

Zhuanlan Sun,Ruilin Zhang,Suhail A. Doi,Luis Furuya-Kanamori,Tianqi Yu,Lifeng Lin,Chang Xu
DOI: https://doi.org/10.1101/2024.02.20.24303083
2024-02-21
Abstract:In evidence synthesis, data extraction is a crucial procedure, but it is time intensive and prone to human error. The rise of large language models (LLMs) in the field of artificial intelligence (AI) offers a solution to these problems through automation. In this case study, we evaluated the performance of two prominent LLM-based AI tools for use in automated data extraction. Randomized trials from two systematic reviews were used as part of the case study. Prompts related to each data extraction task (e.g., extract event counts of control group) were formulated separately for binary and continuous outcomes. The percentage of correct responses ( ) was tested in 39 randomized controlled trials reporting 10 binary outcomes and 49 randomized controlled trials reporting one continuous outcome. The and agreement across three runs for data extracted by two AI tools were compared with well-verified metadata. For the extraction of binary events in the treatment group across 10 outcomes, the ranged from 40% to 87% and from 46% to 97% for ChatPDF and for Claude, respectively. For continuous outcomes, the ranged from 33% to 39% across six tasks (Claude only). The agreement of the response between the three runs of each task was generally good, with Cohen’s kappa statistic ranging from 0.78 to 0.96 and from 0.65 to 0.82 for ChatPDF and Claude, respectively. Our results highlight the potential of ChatPDF and Claude for automated data extraction. Whilst promising, the percentage of correct responses is still unsatisfactory and therefore substantial improvements are needed for current AI tools to be adopted in research practice.
Health Informatics
What problem does this paper attempt to address?
This paper aims to explore the performance of large - language models (LLMs) in automating data extraction from randomized controlled trials (RCTs). Specifically, the researchers evaluated the performance of two LLM - based artificial intelligence tools, ChatPDF and Claude, in automated data extraction tasks, especially in terms of the accuracy and consistency of binary - outcome and continuous - outcome data extraction. ### Research Background In the process of evidence synthesis, data extraction is a crucial step, but this process is time - consuming and prone to human error. With the rise of large - language models in the field of artificial intelligence, it has become possible to solve these problems by automated means. Therefore, the researchers selected RCTs from two systematic reviews as case - study objects to evaluate the performance of ChatPDF and Claude in such tasks. ### Main Research Questions - **Binary - outcome data extraction**: Evaluate the accuracy and consistency of AI tools in extracting event counts (such as the number of deaths) in the treatment group and the control group. - **Continuous - outcome data extraction**: Evaluate the performance of AI tools in extracting continuous - outcome data such as means, standard deviations, and group sizes. ### Methods The researchers designed a series of prompts for different data extraction tasks and carried out three independent runs on ChatPDF and Claude to evaluate their consistency and reliability. The primary outcome measure was the percentage of correct extractions (\(P_{\text{corr}}\)) for each task, and the secondary outcome measures included the percentage of incorrect extractions (\(P_{\text{icorr}}\)) and the proportion of unrecognized information (\(P_{\text{fail}}\)). ### Results - **Binary - outcome**: - For the extraction of group - size information in the treatment group and the control group, the \(P_{\text{corr}}\) of ChatPDF was 54% and 59% respectively, and that of Claude was 72% and 77% respectively. - For the extraction of event counts in the treatment group and the control group, the weighted - average \(P_{\text{corr}}\) of ChatPDF was 64% and 64% respectively, and that of Claude was 70% and 75% respectively. - **Continuous - outcome**: - When using Claude for continuous - outcome data extraction, the \(P_{\text{corr}}\) of each task ranged from 33% to 39%, showing poor performance. ### Discussion - **Binary - outcome**: ChatPDF and Claude performed well in binary - outcome data extraction and were competitive compared with humans. - **Continuous - outcome**: In terms of continuous - outcome data extraction, the performance of AI tools was poor, mainly because the reporting of continuous - outcome was more complex and lacked locating keywords or phrases. ### Conclusion Although current large - language models perform well in binary - outcome data extraction tasks, there is still room for improvement in continuous - outcome data extraction. The study suggests combining AI tools with manual extraction to improve the efficiency and accuracy of automated data extraction. Future research should further optimize prompt design and explore more applications of AI tools.