Abstract:In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at <a class="link-external link-https" href="https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7" rel="external noopener nofollow">this https URL</a> and also open-source the code of the project at <a class="link-external link-https" href="https://github.com/agiresearch/MoralBench" rel="external noopener nofollow">this https URL</a>.

Is ETHICS about ethics? Evaluating the ETHICS benchmark

A Comparative Analysis on Ethical Benchmarking in Large Language Models

Making Intelligence: Ethical Values in IQ and ML Benchmarks

MoralBench: Moral Evaluation of LLMs

Measuring Ethics in AI with AI: A Methodology and Dataset Construction

Scruples: A Corpus of Community Ethical Judgments on 32,000 Real-Life Anecdotes

An Evaluation of GPT-4 on the ETHICS Dataset

Some Issues in Predictive Ethics Modeling: An Annotated Contrast Set of "Moral Stories"

Measuring ethical behavior with AI and natural language processing to assess business success

The ethical ambiguity of AI data enrichment: Measuring gaps in research ethics norms and practices

The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems

AI Ethics: A Bibliometric Analysis, Critical Issues, and Key Gaps

A Word on Machine Ethics: A Response to Jiang et al. (2021)

A Conceptual Framework for Ethical Evaluation of Machine Learning Systems

Values, Ethics, Morals? On the Use of Moral Concepts in NLP Research

Ethical debates amidst flawed healthcare artificial intelligence metrics

Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Eagle: Ethical Dataset Given from Real Interactions

Informed AI Regulation: Comparing the Ethical Frameworks of Leading LLM Chatbots Using an Ethics-Based Audit to Assess Moral Reasoning and Normative Values

Artificial intelligence ethics by design. Evaluating public perception on the importance of ethical design principles of artificial intelligence