Abstract:In evidence synthesis, data extraction is a crucial procedure, but it is time intensive and prone to human error. The rise of large language models (LLMs) in the field of artificial intelligence (AI) offers a solution to these problems through automation. In this case study, we evaluated the performance of two prominent LLM-based AI tools for use in automated data extraction. Randomized trials from two systematic reviews were used as part of the case study. Prompts related to each data extraction task (e.g., extract event counts of control group) were formulated separately for binary and continuous outcomes. The percentage of correct responses ( ) was tested in 39 randomized controlled trials reporting 10 binary outcomes and 49 randomized controlled trials reporting one continuous outcome. The and agreement across three runs for data extracted by two AI tools were compared with well-verified metadata. For the extraction of binary events in the treatment group across 10 outcomes, the ranged from 40% to 87% and from 46% to 97% for ChatPDF and for Claude, respectively. For continuous outcomes, the ranged from 33% to 39% across six tasks (Claude only). The agreement of the response between the three runs of each task was generally good, with Cohen’s kappa statistic ranging from 0.78 to 0.96 and from 0.65 to 0.82 for ChatPDF and Claude, respectively. Our results highlight the potential of ChatPDF and Claude for automated data extraction. Whilst promising, the percentage of correct responses is still unsatisfactory and therefore substantial improvements are needed for current AI tools to be adopted in research practice.

What problem does this paper attempt to address?

This paper aims to explore the performance of large - language models (LLMs) in automating data extraction from randomized controlled trials (RCTs). Specifically, the researchers evaluated the performance of two LLM - based artificial intelligence tools, ChatPDF and Claude, in automated data extraction tasks, especially in terms of the accuracy and consistency of binary - outcome and continuous - outcome data extraction. ### Research Background In the process of evidence synthesis, data extraction is a crucial step, but this process is time - consuming and prone to human error. With the rise of large - language models in the field of artificial intelligence, it has become possible to solve these problems by automated means. Therefore, the researchers selected RCTs from two systematic reviews as case - study objects to evaluate the performance of ChatPDF and Claude in such tasks. ### Main Research Questions - **Binary - outcome data extraction**: Evaluate the accuracy and consistency of AI tools in extracting event counts (such as the number of deaths) in the treatment group and the control group. - **Continuous - outcome data extraction**: Evaluate the performance of AI tools in extracting continuous - outcome data such as means, standard deviations, and group sizes. ### Methods The researchers designed a series of prompts for different data extraction tasks and carried out three independent runs on ChatPDF and Claude to evaluate their consistency and reliability. The primary outcome measure was the percentage of correct extractions (\(P_{\text{corr}}\)) for each task, and the secondary outcome measures included the percentage of incorrect extractions (\(P_{\text{icorr}}\)) and the proportion of unrecognized information (\(P_{\text{fail}}\)). ### Results - **Binary - outcome**: - For the extraction of group - size information in the treatment group and the control group, the \(P_{\text{corr}}\) of ChatPDF was 54% and 59% respectively, and that of Claude was 72% and 77% respectively. - For the extraction of event counts in the treatment group and the control group, the weighted - average \(P_{\text{corr}}\) of ChatPDF was 64% and 64% respectively, and that of Claude was 70% and 75% respectively. - **Continuous - outcome**: - When using Claude for continuous - outcome data extraction, the \(P_{\text{corr}}\) of each task ranged from 33% to 39%, showing poor performance. ### Discussion - **Binary - outcome**: ChatPDF and Claude performed well in binary - outcome data extraction and were competitive compared with humans. - **Continuous - outcome**: In terms of continuous - outcome data extraction, the performance of AI tools was poor, mainly because the reporting of continuous - outcome was more complex and lacked locating keywords or phrases. ### Conclusion Although current large - language models perform well in binary - outcome data extraction tasks, there is still room for improvement in continuous - outcome data extraction. The study suggests combining AI tools with manual extraction to improve the efficiency and accuracy of automated data extraction. Future research should further optimize prompt design and explore more applications of AI tools.

How good are large language models for automated data extraction from randomized trials?

Data extraction for evidence synthesis using a large language model: A proof‐of‐concept study

Performance of two large language models for data extraction in evidence synthesis

Automatically Extracting Numerical Results from Randomized Controlled Trials with Large Language Models

Jointly Extracting Interventions, Outcomes, and Findings from RCT Reports with LLMs

Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Collaborative Large Language Models for Automated Data Extraction in Living Systematic Reviews

Benchmarking Human-AI Collaboration for Common Evidence Appraisal Tools

Investigating Deep-Learning NLP for Automating the Extraction of Oncology Efficacy Endpoints from Scientific Literature

Inferring Which Medical Treatments Work from Reports of Clinical Trials

Enhancing Real-World Data Extraction in Clinical Research: Evaluating the Impact of the Implementation of Large Language Models in Hospital Settings

Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews

Exploring the potential of Claude 2 for risk of bias assessment: Using a large language model to assess randomized controlled trials with RoB 2

From RAGs to riches: Using large language models to write documents for clinical trials

Zero-Shot Information Extraction for Clinical Meta-Analysis using Large Language Models

A Scoping Review of Adopted Information Extraction Methods for RCTs

Using large language models for safety-related table summarization in clinical study reports

Automated information extraction for behavioural interventions: evaluation and reflections on interdisciplinary AI development

Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations

Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models

Zero- and few-shot prompting of generative large language models provides weak assessment of risk of bias in clinical trials