GPT for RCTs? Using AI to measure adherence to reporting guidelines

James G Wrightson,Paul Blazey,David Moher,Karim M Khan,Clare L Ardern
DOI: https://doi.org/10.1101/2023.12.14.23299971
2024-05-14
Abstract:Background: Adherence to established reporting guidelines can improve clinical trial reporting standards, but attempts to improve adherence have produced mixed results. This exploratory study aimed to determine how accurate a Large Language Model generative AI system (AI-LLM) was for determining reporting guideline compliance in a sample of sports medicine clinical trial reports. Design and Methods: This study was an exploratory retrospective data analysis. The OpenAI GPT-4 and Meta LLama2 AI-LLMa were evaluated for their ability to determine reporting guideline adherence in a sample of 113 published sports medicine and exercise science clinical trial reports. For each paper, the GPT-4-Turbo and Llama 2 70B models were prompted to answer a series of nine reporting guideline questions about the text of the article. The GPT-4-Vision model was prompted to answer two additional reporting guideline questions about the participant flow diagram in a subset of articles. The dataset was randomly split (80/20) into a TRAIN and TEST dataset. Hyperparameter and fine-tuning were performed using the TRAIN dataset. The Llama2 model was fine-tuned using the data from the GPT-4-Turbo analysis of the TRAIN dataset. Primary outcome measure: Model performance (F1-score, classification accuracy) was assessed using the TEST dataset. Results: Across all questions about the article text, the GPT-4-Turbo AI-LLM demonstrated acceptable performance (F1-score = 0.89, accuracy[95% CI] = 90%[85-94%]). Accuracy for all reporting guidelines was > 80%. The Llama2 model accuracy was initially poor (F1-score = 0.63, accuracy[95%CI] = 64%[57-71%]), and improved with fine-tuning (F1-score = 0.84, accuracy[95%CI] = 83%[77-88%]). The GPT-4-Vision model accurately identified all participant flow diagrams (accuracy[95% CI] = 100%[89-100%]) but was less accurate at identifying when details were missing from the flow diagram (accuracy[95% CI] = 57%[39-73%]). Conclusions: Both the GPT-4 and fine-tuned Llama2 AI-LLMs showed promise as tools for assessing reporting guideline compliance. Next steps should include developing an efficent, open-source AI-LLM and exploring methods to improve model accuracy.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of large - language models (AI - LLM) in determining whether clinical trial reports comply with reporting guidelines. Specifically, the study aims to use two AI models, OpenAI's GPT - 4 and Meta's Llama2, to judge how accurate these models are in analyzing the compliance with reporting guidelines when analyzing clinical trial reports in the fields of sports medicine and exercise science. The main objectives of the study are: 1. **Evaluate model performance**: By testing the ability of the GPT - 4 and Llama2 models to identify whether specific reporting guideline items are met in clinical trial reports, evaluate the accuracy and reliability of these models. 2. **Explore improvement methods**: The study also explores methods to improve the performance of the Llama2 model through fine - tuning, as well as the possibility of developing an efficient, open - source AI - LLM. 3. **Improve report quality**: The ultimate goal is to hope that through the use of these AI tools, journal editors, publishers, peer reviewers, and authors can be helped to check the compliance with reporting guidelines more quickly and accurately, thereby improving the overall quality and transparency of clinical trial reports. The research background points out that the insufficient quality of clinical trial reports is a common problem, which not only affects the reliability and credibility of the research, but may also have a negative impact on the treatment of patients. Therefore, improving reporting standards is an ethical necessity. Using AI technology to assist this process can save a great deal of time and resources while improving the efficiency of the editorial process.