TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations

Nathalie Maria Kirch,Konstantin Hebenstreit,Matthias Samwald
2024-10-10
Abstract:We present the TRIAGE Benchmark, a novel machine ethics (ME) benchmark that tests LLMs' ability to make ethical decisions during mass casualty incidents. It uses real-world ethical dilemmas with clear solutions designed by medical professionals, offering a more realistic alternative to annotation-based benchmarks. TRIAGE incorporates various prompting styles to evaluate model performance across different contexts. Most models consistently outperformed random guessing, suggesting LLMs may support decision-making in triage scenarios. Neutral or factual scenario formulations led to the best performance, unlike other ME benchmarks where ethical reminders improved outcomes. Adversarial prompts reduced performance but not to random guessing levels. Open-source models made more morally serious errors, and general capability overall predicted better performance.
Computers and Society,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the ability of large language models (LLMs) to make ethical decisions in medical emergencies. Specifically, the paper addresses the following issues by introducing a new benchmark test named TRIAGE: 1. **Limitations of existing ethical benchmark tests**: - Existing machine ethics (ME) benchmark tests mainly rely on fictional or fantasy scenarios, which are created by researchers or extracted from stories and have a gap with the actual ethical dilemmas in the real world. - Many existing ME benchmark tests fail to fully consider the diversity of cultural values, limiting the universality of the results. 2. **Lack of a real - world medical ethics decision - making framework**: - In order to more accurately evaluate the ethical decision - making ability of LLMs, a benchmark test based on real - world medical scenarios is required, especially those involving mass - casualty incidents. 3. **The impact of different prompt styles on model performance**: - It is necessary to study how different prompt styles (such as ethical reminders, adversarial prompts, etc.) affect the performance of LLMs in ethical decision - making. 4. **The relationship between model capabilities and ethical decision - making**: - Explore whether the general capabilities of the model are related to its ethical decision - making capabilities and analyze the performance differences of different models in ethical decision - making. To solve these problems, the paper introduced the TRIAGE benchmark test, which is based on existing medical triage models (such as START and jumpSTART) and uses real patient scenarios to evaluate the ethical decision - making ability of LLMs in mass - casualty incidents. In addition, the paper also explored the impact of different prompt styles and grammatical structures on model performance, with the aim of providing valuable insights for future research and applications. ### Core features of the TRIAGE benchmark test: - **Based on real - world ethical dilemmas**: Use real cases designed by medical professionals to ensure the realism and reliability of the test scenarios. - **Diverse prompt styles**: Include neutral prompts, ethical reminders (such as deontology and utilitarianism), and adversarial prompts (such as "doctor jailbreak" and "medical staff jailbreak") to evaluate the impact of different contexts on model performance. - **Detailed error classification**: Classify the model's errors into overcaring, undercaring, and instruction - following errors to better understand the model's failure modes. Through these methods, the TRIAGE benchmark test aims to provide a more realistic and reliable tool to evaluate the performance of LLMs in medical ethics decision - making.