Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs

Li Wang,Xi Chen,XiangWen Deng,Hao Wen,MingKe You,WeiZhi Liu,Qi Li,Jian Li
DOI: https://doi.org/10.1038/s41746-024-01029-4
IF: 15.2
2024-02-21
npj Digital Medicine
Abstract:The use of large language models (LLMs) in clinical medicine is currently thriving. Effectively transferring LLMs' pertinent theoretical knowledge from computer science to their application in clinical medicine is crucial. Prompt engineering has shown potential as an effective method in this regard. To explore the application of prompt engineering in LLMs and to examine the reliability of LLMs, different styles of prompts were designed and used to ask different LLMs about their agreement with the American Academy of Orthopedic Surgeons (AAOS) osteoarthritis (OA) evidence-based guidelines. Each question was asked 5 times. We compared the consistency of the findings with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times. gpt-4-Web with ROT prompting had the highest overall consistency (62.9%) and a significant performance for strong recommendations, with a total consistency of 77.5%. The reliability of the different LLMs for different prompts was not stable (Fleiss kappa ranged from −0.002 to 0.984). This study revealed that different prompts had variable effects across various models, and the gpt-4-Web with ROT prompt was the most consistent. An appropriate prompt could improve the accuracy of responses to professional medical questions.
health care sciences & services,medical informatics
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to explore how to improve the accuracy and consistency of large language models (LLMs) in answering professional medical questions in the field of clinical medicine through **prompt engineering (Prompt Engineering)**. Specifically, the research mainly focuses on the following points: 1. **Evaluating the effects of different prompting methods**: By designing different types of prompts (such as IO prompts, 0 - COT prompts, P - COT prompts, and ROT prompts), the researchers tested the performance of these prompts in different LLMs to evaluate their impact on the consistency and reliability of LLMs' answers. 2. **Verifying the consistency between LLMs and clinical guidelines**: The researchers used the evidence - based guidelines of the American Academy of Orthopaedic Surgeons (AAOS) on osteoarthritis (OA) as a standard to evaluate whether the answers of LLMs are in line with these guidelines. In particular, an analysis was carried out for different evidence strengths of strong recommendation, moderate recommendation, limited recommendation, and consensus recommendation. 3. **Exploring the best prompting strategy**: Through multiple experiments, the researchers attempted to find out which prompting strategy can make the answers of LLMs more consistent and reliable. The results showed that gpt - 4 - Web combined with ROT prompts performed best in terms of consistency, achieving an overall consistency of 62.9%, especially at the strong recommendation level, with a consistency of 77.5%. 4. **Discussing the factors affecting the performance of LLMs**: In addition to prompt engineering, the research also explored the impact of other factors (such as model architecture, parameter settings, fine - tuning techniques, etc.) on the performance of LLMs. The research indicates that adjusting internal parameters (such as temperature settings) can significantly change the performance of LLMs. ### Research background With the wide application of large language models in natural language processing tasks, their application in the medical field has gradually attracted attention. However, the current performance of LLMs in the medical field is not perfect, especially in complex case diagnosis and guideline consistency assessment, there are certain limitations. Therefore, researchers hope to optimize the application of LLMs in the medical field through prompt engineering and improve the accuracy and consistency of their answers to medical questions. ### Main findings - **Significant differences in the effects of different prompting methods**: The ROT prompt performs best on gpt - 4 - Web, while the effects of other models and prompt combinations are different. - **The answer consistency of LLMs is unstable**: Even for the same model, different answers may be generated under different prompts, indicating that self - consistency is an important evaluation index. - **Temperature settings affect model performance**: For example, gpt - 3.5 - API - 0 and gpt - 3.5 - ft - 0 show perfect reliability at a temperature of 0, but perform poorly at other temperature settings. ### Future research directions The researchers suggest that future research should further optimize prompt engineering to make it more closely combined with different clinical scenarios and develop prompt guidelines specifically for patients and doctors. In addition, more methods need to be explored to improve the effectiveness and reliability of LLMs in the medical environment, including combining model development, parameter adjustment, and fine - tuning techniques. In conclusion, this study reveals the potential of prompt engineering in improving the accuracy of LLMs in answering medical questions and provides a valuable reference for future research.