Abstract:Background: Large language models (LLMs) have significantly enhanced the Natural Language Processing (NLP), offering significant potential in facilitating medical literature review. However, the accuracy, stability and prompt strategies associated with LLMs in extracting complex medical information have not been adequately investigated. Our study assessed the capabilities of GPT-3.5 and GPT-4.0 in extracting or summarizing seven crucial medical information items from the title and abstract of research papers. We also validated the impact of prompt engineering strategies and the effectiveness of evaluating metrics. Methodology: We adopted a stratified sampling method to select 100 papers from the teaching schools and departments in the LKS Faculty of Medicine, University of Hong Kong, published between 2015 and 2023. GPT-3.5 and GPT-4.0 were instructed to extract seven pieces of information, including study design, sample size, data source, patient, intervention, comparison, and outcomes. The experiment incorporated three prompt engineering strategies: persona, chain-of-thought and few-shot prompting. We employed three metrics to assess the alignment between the GPT output and the ground truth: BERTScore, ROUGE-1 and a self-developed GPT-4.0 evaluator. Finally, we evaluated and compared the proportion of correct answers among different GPT versions and prompt engineering strategies. Results: GPT demonstrated robust capabilities in accurately extracting medical information from titles and abstracts. The average accuracy of GPT-4.0, when paired with the optimal prompt engineering strategy, ranged from 0.688 to 0.964 among the seven items, with sample size achieving the highest score and intervention yielding the lowest. GPT version was shown to be a statistically significant factor in model performance, but prompt engineering strategies did not exhibit cumulative effects on model performance. Additionally, our results showed that the GPT-4.0 evaluator outperformed the ROUGE-1 and BERTScore in assessing the alignment of information (Accuracy: GPT-4.0 Evaluator: 0.9714, ROUGE-1: 0.9429, BERTScore: 0.8714). Conclusion: Our result confirms the effectiveness of LLMs in extracting medical information, suggesting their potential as efficient tools for literature review. We recommend utilizing an advanced version of LLMs to enhance the model performance, while prompt engineering strategies should be tailored to the specific tasks. Additionally, LLMs show promise as an evaluation tool to assess the model performance related to complex information processing.

IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Towards Evaluating and Building Versatile Large Language Models for Medicine

Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model

AlpaCare:Instruction-tuned Large Language Models for Medical Application

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications

Improving Clinical Expertise in Large Language Models Using Electronic Medical Records

MedGo: A Chinese Medical Large Language Model

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records

Large Language Models Leverage External Knowledge to Extend Clinical Insight Beyond Language Boundaries

PeFoMed: Parameter Efficient Fine-tuning of Multimodal Large Language Models for Medical Imaging

Towards Democratizing Multilingual Large Language Models For Medicine Through A Two-Stage Instruction Fine-tuning Approach

DoctorGPT: A Large Language Model with Chinese Medical Question-Answering Capabilities

MMedAgent: Learning to Use Medical Tools with Multi-modal Agent

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4

Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding

CMed-GPT: Prompt Tuning for Entity-Aware Chinese Medical Dialogue Generation

MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation

Me LLaMA: Foundation Large Language Models for Medical Applications