Inducing Human-like Biases in Moral Reasoning Language Models

Artem Karpov,Seong Hah Cho,Austin Meek,Raymond Koopmanschap,Lucy Farnik,Bogdan-Ionut Cirstea
2024-11-23
Abstract:In this work, we study the alignment (BrainScore) of large language models (LLMs) fine-tuned for moral reasoning on behavioral data and/or brain data of humans performing the same task. We also explore if fine-tuning several LLMs on the fMRI data of humans performing moral reasoning can improve the BrainScore. We fine-tune several LLMs (BERT, RoBERTa, DeBERTa) on moral reasoning behavioral data from the ETHICS benchmark [Hendrycks et al., 2020], on the moral reasoning fMRI data from Koster-Hale et al. [2013], or on both. We study both the accuracy on the ETHICS benchmark and the BrainScores between model activations and fMRI data. While larger models generally performed better on both metrics, BrainScores did not significantly improve after fine-tuning.
Artificial Intelligence,Computers and Society,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the performance of large - language models (LLMs) on moral - reasoning tasks, especially by fine - tuning these models to better align with the activity patterns of the human brain when performing the same tasks. Specifically, the authors have studied the following two aspects: 1. **Brain - Model Alignment Metric (BrainScore)**: Improve the brain - model alignment metric (BrainScore) of these models in moral - reasoning tasks by fine - tuning large - language models to match human - behavior data and/or brain data (such as fMRI data). BrainScore is an index that measures the similarity between the internal representation of a model and human - brain activity. 2. **Evaluation of Fine - Tuning Effects**: Explore whether the BrainScore of multiple large - language models (such as BERT, RoBERTa, DeBERTa) can be improved by fine - tuning them on fMRI data when humans perform moral reasoning. At the same time, the authors also evaluated the performance of these fine - tuned models in the ETHICS benchmark test. The core objective of the paper is to explore how to use human - neural data to enhance the moral - reasoning ability of artificial - intelligence models and whether this enhancement can be achieved by increasing the degree of alignment between the models and human - brain activity. However, the experimental results show that although larger models generally perform better in terms of accuracy and BrainScore, fine - tuning does not significantly improve BrainScore. This finding indicates that more data and more effective fine - tuning methods may be required to improve brain - model alignment in specific domains.