Inducing Human-like Biases in Moral Reasoning Language Models

Artem Karpov,Seong Hah Cho,Austin Meek,Raymond Koopmanschap,Lucy Farnik,Bogdan-Ionut Cirstea

2024-11-23

Abstract:In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.

Artificial Intelligence,Computers and Society,Machine Learning

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) on moral - reasoning tasks, especially by fine - tuning these models to better align with the activity patterns of the human brain when performing the same tasks. Specifically, the authors have studied the following two aspects: 1. **Brain - Model Alignment Metric (BrainScore)**: Improve the brain - model alignment metric (BrainScore) of these models in moral - reasoning tasks by fine - tuning large - language models to match human - behavior data and/or brain data (such as fMRI data). BrainScore is an index that measures the similarity between the internal representation of a model and human - brain activity. 2. **Evaluation of Fine - Tuning Effects**: Explore whether the BrainScore of multiple large - language models (such as BERT, RoBERTa, DeBERTa) can be improved by fine - tuning them on fMRI data when humans perform moral reasoning. At the same time, the authors also evaluated the performance of these fine - tuned models in the ETHICS benchmark test. The core objective of the paper is to explore how to use human - neural data to enhance the moral - reasoning ability of artificial - intelligence models and whether this enhancement can be achieved by increasing the degree of alignment between the models and human - brain activity. However, the experimental results show that although larger models generally perform better in terms of accuracy and BrainScore, fine - tuning does not significantly improve BrainScore. This finding indicates that more data and more effective fine - tuning methods may be required to improve brain - model alignment in specific domains.

Inducing Human-like Biases in Moral Reasoning Language Models

Fine-Tuning Language Models for Ethical Ambiguity: A Comparative Study of Alignment with Human Responses

Moral Alignment for LLM Agents

MoCa: Measuring Human-Language Model Alignment on Causal and Moral Judgment Tasks

Exploring the psychology of LLMs' Moral and Legal Reasoning

Language Model Alignment in Multilingual Trolley Problems

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions

Beyond Labels: Aligning Large Language Models with Human-like Reasoning

Exploring and steering the moral compass of Large Language Models

Moral Foundations of Large Language Models

The Moral Mind(s) of Large Language Models

Decoding Multilingual Moral Preferences: Unveiling LLM's Biases Through the Moral Machine Experiment

The moral machine experiment on large language models

Large-scale moral machine experiment on large language models

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Instruction-tuning Aligns LLMs to the Human Brain

MoralBench: Moral Evaluation of LLMs

Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in LLMs

Evaluating Moral Beliefs across LLMs through a Pluralistic Framework