Abstract:In the rapidly evolving field of artificial intelligence, large language models (LLMs) have emerged as powerful tools for a myriad of applications, from natural language processing to decision-making support systems. However, as these models become increasingly integrated into societal frameworks, the imperative to ensure they operate within ethical and moral boundaries has never been more critical. This paper introduces a novel benchmark designed to measure and compare the moral reasoning capabilities of LLMs. We present the first comprehensive dataset specifically curated to probe the moral dimensions of LLM outputs, addressing a wide range of ethical dilemmas and scenarios reflective of real-world complexities. The main contribution of this work lies in the development of benchmark datasets and metrics for assessing the moral identity of LLMs, which accounts for nuance, contextual sensitivity, and alignment with human ethical standards. Our methodology involves a multi-faceted approach, combining quantitative analysis with qualitative insights from ethics scholars to ensure a thorough evaluation of model performance. By applying our benchmark across several leading LLMs, we uncover significant variations in moral reasoning capabilities of different models. These findings highlight the importance of considering moral reasoning in the development and evaluation of LLMs, as well as the need for ongoing research to address the biases and limitations uncovered in our study. We publicly release the benchmark at <a class="link-external link-https" href="https://drive.google.com/drive/u/0/folders/1k93YZJserYc2CkqP8d4B3M3sgd3kA8W7" rel="external noopener nofollow">this https URL</a> and also open-source the code of the project at <a class="link-external link-https" href="https://github.com/agiresearch/MoralBench" rel="external noopener nofollow">this https URL</a>.

AI Benchmarks and Datasets for LLM Evaluation

AIBench: an Industry Standard AI Benchmark Suite from Internet Services.

Aibench: an industry standard ai benchmark suite

AIBench: Towards Scalable and Comprehensive Datacenter AI Benchmarking

COMPL-AI Framework: A Technical Interpretation and LLM Benchmarking Suite for the EU Artificial Intelligence Act

AI Competitions and Benchmarks: Dataset Development

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices

Dynamic Intelligence Assessment: Benchmarking LLMs on the Road to AGI with a Focus on Model Confidence

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

SAIBench: Benchmarking AI for Science

CEBench: A Benchmarking Toolkit for the Cost-Effectiveness of LLM Pipelines

AIBench Training: Balanced Industry-Standard AI Training Benchmarking

Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs

Introducing Milabench: Benchmarking Accelerators for AI

LawBench: Benchmarking Legal Knowledge of Large Language Models

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

AAAR-1.0: Assessing AI's Potential to Assist Research

MoralBench: Moral Evaluation of LLMs

CERN for AI: a theoretical framework for autonomous simulation-based artificial intelligence testing and alignment