Abstract:The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from −0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to explore how to improve the accuracy and consistency of large language models (LLMs) in answering professional medical questions in the field of clinical medicine through **prompt engineering (Prompt Engineering)**. Specifically, the research mainly focuses on the following points: 1. **Evaluating the effects of different prompting methods**: By designing different types of prompts (such as IO prompts, 0 - COT prompts, P - COT prompts, and ROT prompts), the researchers tested the performance of these prompts in different LLMs to evaluate their impact on the consistency and reliability of LLMs' answers. 2. **Verifying the consistency between LLMs and clinical guidelines**: The researchers used the evidence - based guidelines of the American Academy of Orthopaedic Surgeons (AAOS) on osteoarthritis (OA) as a standard to evaluate whether the answers of LLMs are in line with these guidelines. In particular, an analysis was carried out for different evidence strengths of strong recommendation, moderate recommendation, limited recommendation, and consensus recommendation. 3. **Exploring the best prompting strategy**: Through multiple experiments, the researchers attempted to find out which prompting strategy can make the answers of LLMs more consistent and reliable. The results showed that gpt - 4 - Web combined with ROT prompts performed best in terms of consistency, achieving an overall consistency of 62.9%, especially at the strong recommendation level, with a consistency of 77.5%. 4. **Discussing the factors affecting the performance of LLMs**: In addition to prompt engineering, the research also explored the impact of other factors (such as model architecture, parameter settings, fine - tuning techniques, etc.) on the performance of LLMs. The research indicates that adjusting internal parameters (such as temperature settings) can significantly change the performance of LLMs. ### Research background With the wide application of large language models in natural language processing tasks, their application in the medical field has gradually attracted attention. However, the current performance of LLMs in the medical field is not perfect, especially in complex case diagnosis and guideline consistency assessment, there are certain limitations. Therefore, researchers hope to optimize the application of LLMs in the medical field through prompt engineering and improve the accuracy and consistency of their answers to medical questions. ### Main findings - **Significant differences in the effects of different prompting methods**: The ROT prompt performs best on gpt - 4 - Web, while the effects of other models and prompt combinations are different. - **The answer consistency of LLMs is unstable**: Even for the same model, different answers may be generated under different prompts, indicating that self - consistency is an important evaluation index. - **Temperature settings affect model performance**: For example, gpt - 3.5 - API - 0 and gpt - 3.5 - ft - 0 show perfect reliability at a temperature of 0, but perform poorly at other temperature settings. ### Future research directions The researchers suggest that future research should further optimize prompt engineering to make it more closely combined with different clinical scenarios and develop prompt guidelines specifically for patients and doctors. In addition, more methods need to be explored to improve the effectiveness and reliability of LLMs in the medical environment, including combining model development, parameter adjustment, and fine - tuning techniques. In conclusion, this study reveals the potential of prompt engineering in improving the accuracy of LLMs in answering medical questions and provides a valuable reference for future research.

Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Prompt engineering on leveraging large language models in generating response to InBasket messages

Promptwise: Prompt Engineering Paradigm for Enhanced Patient-Large Language Model Interactions Towards Medical Education

Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation

Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study

Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record

Encouragement vs. liability: How prompt engineering influences ChatGPT-4's radiology exam performance

Prompt engineering with ChatGPT3.5 and GPT4 to improve patient education on retinal diseases

Advanced Prompting As a Catalyst: Empowering Large Language Models in the Management of Gastrointestinal Cancers

Prompt engineering: The next big skill in rheumatology research

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

Large Language Model in Medical Information Extraction from Titles and Abstracts with Prompt Engineering Strategies: A Comparative Study of GPT-3.5 and GPT-4

What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use

Prompt Engineering Paradigms for Medical Applications: Scoping Review

Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial

Evaluating the Impact of a Specialized LLM on Physician Experience in Clinical Decision Support: A Comparison of Ask Avo and ChatGPT-4

An Active Inference Strategy for Prompting Reliable Responses from Large Language Models in Medical Practice

Prompting is all you need: LLMs for systematic review screening

Prompt engineering paradigms for medical applications: scoping review and recommendations for better practices

Enhancing Computer Programming Education with LLMs: A Study on Effective Prompt Engineering for Python Code Generation

Customizing GPT-4 for clinical information retrieval from standard operating procedures