Abstract:Background: Large language models (LLMs) have significantly enhanced the Natural Language Processing (NLP), offering significant potential in facilitating medical literature review. However, the accuracy, stability and prompt strategies associated with LLMs in extracting complex medical information have not been adequately investigated. Our study assessed the capabilities of GPT-3.5 and GPT-4.0 in extracting or summarizing seven crucial medical information items from the title and abstract of research papers. We also validated the impact of prompt engineering strategies and the effectiveness of evaluating metrics. Methodology: We adopted a stratified sampling method to select 100 papers from the teaching schools and departments in the LKS Faculty of Medicine, University of Hong Kong, published between 2015 and 2023. GPT-3.5 and GPT-4.0 were instructed to extract seven pieces of information, including study design, sample size, data source, patient, intervention, comparison, and outcomes. The experiment incorporated three prompt engineering strategies: persona, chain-of-thought and few-shot prompting. We employed three metrics to assess the alignment between the GPT output and the ground truth: BERTScore, ROUGE-1 and a self-developed GPT-4.0 evaluator. Finally, we evaluated and compared the proportion of correct answers among different GPT versions and prompt engineering strategies. Results: GPT demonstrated robust capabilities in accurately extracting medical information from titles and abstracts. The average accuracy of GPT-4.0, when paired with the optimal prompt engineering strategy, ranged from 0.688 to 0.964 among the seven items, with sample size achieving the highest score and intervention yielding the lowest. GPT version was shown to be a statistically significant factor in model performance, but prompt engineering strategies did not exhibit cumulative effects on model performance. Additionally, our results showed that the GPT-4.0 evaluator outperformed the ROUGE-1 and BERTScore in assessing the alignment of information (Accuracy: GPT-4.0 Evaluator: 0.9714, ROUGE-1: 0.9429, BERTScore: 0.8714). Conclusion: Our result confirms the effectiveness of LLMs in extracting medical information, suggesting their potential as efficient tools for literature review. We recommend utilizing an advanced version of LLMs to enhance the model performance, while prompt engineering strategies should be tailored to the specific tasks. Additionally, LLMs show promise as an evaluation tool to assess the model performance related to complex information processing.

An Empirical Study on Information Extraction using Large Language Models

An Empirical Study on Information Extraction using Large Language Models

Large Language Models for Generative Information Extraction: A Survey

Exploring the Potential of Large Language Models in Molecular Tasks: An Insightful Evaluation with GPT‐4

Large Language Models as Data Preprocessors

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Large Language Models Meet NLP: A Survey

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Exploring the Latest LLMs for Leaderboard Extraction

An Evaluation of Large Language Models in Bioinformatics Research

Summary of ChatGPT/GPT-4 Research and Perspective Towards the Future of Large Language Models

The Shifted and The Overlooked: A Task-oriented Investigation of User-GPT Interactions

A Survey on Large Language Models from Concept to Implementation

Assessing the Performance of Chinese Open Source Large Language Models in Information Extraction Tasks

A Survey of GPT-3 Family Large Language Models Including ChatGPT and GPT-4

Automatic Information Extraction From Employment Tribunal Judgements Using Large Language Models

Large language models (LLMs): survey, technical frameworks, and future challenges

Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4