Abstract:Background: Large language models (LLMs) have significantly enhanced the Natural Language Processing (NLP), offering significant potential in facilitating medical literature review. However, the accuracy, stability and prompt strategies associated with LLMs in extracting complex medical information have not been adequately investigated. Our study assessed the capabilities of GPT-3.5 and GPT-4.0 in extracting or summarizing seven crucial medical information items from the title and abstract of research papers. We also validated the impact of prompt engineering strategies and the effectiveness of evaluating metrics. Methodology: We adopted a stratified sampling method to select 100 papers from the teaching schools and departments in the LKS Faculty of Medicine, University of Hong Kong, published between 2015 and 2023. GPT-3.5 and GPT-4.0 were instructed to extract seven pieces of information, including study design, sample size, data source, patient, intervention, comparison, and outcomes. The experiment incorporated three prompt engineering strategies: persona, chain-of-thought and few-shot prompting. We employed three metrics to assess the alignment between the GPT output and the ground truth: BERTScore, ROUGE-1 and a self-developed GPT-4.0 evaluator. Finally, we evaluated and compared the proportion of correct answers among different GPT versions and prompt engineering strategies. Results: GPT demonstrated robust capabilities in accurately extracting medical information from titles and abstracts. The average accuracy of GPT-4.0, when paired with the optimal prompt engineering strategy, ranged from 0.688 to 0.964 among the seven items, with sample size achieving the highest score and intervention yielding the lowest. GPT version was shown to be a statistically significant factor in model performance, but prompt engineering strategies did not exhibit cumulative effects on model performance. Additionally, our results showed that the GPT-4.0 evaluator outperformed the ROUGE-1 and BERTScore in assessing the alignment of information (Accuracy: GPT-4.0 Evaluator: 0.9714, ROUGE-1: 0.9429, BERTScore: 0.8714). Conclusion: Our result confirms the effectiveness of LLMs in extracting medical information, suggesting their potential as efficient tools for literature review. We recommend utilizing an advanced version of LLMs to enhance the model performance, while prompt engineering strategies should be tailored to the specific tasks. Additionally, LLMs show promise as an evaluation tool to assess the model performance related to complex information processing.

How Does a Generative Large Language Model Perform on Domain-Specific Information Extraction?─A Comparison between GPT-4 and a Rule-Based Method on Band Gap Extraction

Accurate Prediction of Experimental Band Gaps from Large Language Model-Based Data Extraction

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

An Empirical Study on Information Extraction using Large Language Models

Mining experimental data from Materials Science literature with Large Language Models: an evaluation study

Large Language Models for Generative Information Extraction: A Survey

Large Language Models as Master Key: Unlocking the Secrets of Materials Science with GPT

Comparative Study of Large Language Model Architectures on Frontier

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4

The Impact of Large Language Models on Scientific Discovery: a Preliminary Study using GPT-4

Flexible, Model-Agnostic Method for Materials Data Extraction from Text Using General Purpose Language Models

What can Large Language Models do in chemistry? A comprehensive benchmark on eight tasks

Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Datasets

Evaluating Large Language Models on Graphs: Performance Insights and Comparative Analysis

Exploring Boundary of GPT-4V on Marine Analysis: A Preliminary Case Study

Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model

GPT4Graph: Can Large Language Models Understand Graph Structured Data ? an Empirical Evaluation and Benchmarking.

The Promise and Peril of Generative AI: Evidence from GPT-4 as Sell-Side Analysts

Exploring the Potential of Large Language Models in Graph Generation