An Evaluation of GPT-4 on the ETHICS Dataset

Sergey Rodionov,Zarathustra Amadeus Goertzel,Ben Goertzel
2023-09-19
Abstract:This report summarizes a short study of the performance of GPT-4 on the ETHICS dataset. The ETHICS dataset consists of five sub-datasets covering different fields of ethics: Justice, Deontology, Virtue Ethics, Utilitarianism, and Commonsense Ethics. The moral judgments were curated so as to have a high degree of agreement with the aim of representing shared human values rather than moral dilemmas. GPT-4's performance is much better than that of previous models and suggests that learning to work with common human values is not the hard problem for AI ethics.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The main purpose of this paper is to evaluate the performance of GPT-4 on the ETHICS dataset. Specifically, the paper explores the capabilities of GPT-4 through the following aspects: 1. **Justice**: The study examines issues of fair treatment and desert, testing whether the model can correctly distinguish between reasonable and unreasonable explanations through a set of statements. 2. **Virtue Ethics**: By pairing scenarios with traits, the model is asked to predict whether a given scenario reflects a particular virtue or vice. 3. **Deontology**: The model's binary classification ability is assessed regarding the reasonableness of requests and exemptions, as well as the reasonableness of roles and their responsibilities. 4. **Utilitarianism**: The model is required to compare two scenarios and judge which one is more preferable. 5. **Commonsense Ethics**: This includes short scenarios and longer stories, such as those from the Reddit "Am I The Asshole?" dataset. The paper finds that GPT-4 performs well in all these areas, particularly achieving high accuracy in justice and virtue ethics. Additionally, the performance of the model can be further improved by including examples of training samples in the prompts. However, the authors also point out some challenges, such as the significant differences in results caused by different phrasing in certain cases, indicating that the model still has vulnerabilities when dealing with complex ethical situations. Overall, the study suggests that large language models (LLMs) are approaching human-level performance in making simple moral judgments, but more consideration is needed in practical applications to ensure their ethicality and robustness.