Prompt engineering paradigms for medical applications: scoping review and recommendations for better practices

Jamil Zaghir,Marco Naguib,Mina Bjelogrlic,Aurélie Névéol,Xavier Tannier,Christian Lovis
2024-05-02
Abstract:Prompt engineering is crucial for harnessing the potential of large language models (LLMs), especially in the medical domain where specialized terminology and phrasing is used. However, the efficacy of prompt engineering in the medical domain remains to be explored. In this work, 114 recent studies (2022-2024) applying prompt engineering in medicine, covering prompt learning (PL), prompt tuning (PT), and prompt design (PD) are reviewed. PD is the most prevalent (78 articles). In 12 papers, PD, PL, and PT terms were used interchangeably. ChatGPT is the most commonly used LLM, with seven papers using it for processing sensitive clinical data. Chain-of-Thought emerges as the most common prompt engineering technique. While PL and PT articles typically provide a baseline for evaluating prompt-based approaches, 64% of PD studies lack non-prompt-related baselines. We provide tables and figures summarizing existing work, and reporting recommendations to guide future research contributions.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
This paper aims to explore and analyze the application and development of Prompt Engineering in the medical field. Specifically, the paper reviews 114 research articles on the application of Prompt Engineering in medicine published between 2022 and 2024, covering three aspects: Prompt Design (PD), Prompt Learning (PL), and Prompt Tuning (PT). The study found that prompt design is the most common method, with a total of 78 articles involved. Additionally, the paper points out that although prompt learning and prompt tuning often provide non-prompt-related baselines to evaluate their effectiveness, 64% of prompt design studies lack such baseline comparisons. The paper also discusses the effectiveness of different prompt engineering techniques, such as Chain-of-Thought (CoT), which is one of the most commonly used techniques and performs well in tasks like multiple-choice questions. Furthermore, the paper highlights the issue of terminology inconsistency in current research and analyzes the trends in the selection of large language models (LLMs) used, such as the widespread application of ChatGPT in prompt design. Finally, the paper identifies future research directions, including improving evaluation methods, increasing the use of local LLMs, and further validating the effectiveness of prompt engineering. Through these analyses, the paper hopes to provide valuable insights and guidance for researchers and users in the field of medical natural language processing.