Zhijing Jin,Max Kleiman-Weiner,Giorgio Piatti,Sydney Levine,Jiarui Liu,Fernando Gonzalez,Francesco Ortu,András Strausz,Mrinmaya Sachan,Rada Mihalcea,Yejin Choi,Bernhard Schölkopf
Abstract:We evaluate the moral alignment of large language models (LLMs) with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at <a class="link-external link-https" href="https://github.com/causalNLP/moralmachine" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to evaluate the moral consistency of large language models (LLMs) in multilingual trolley problems, specifically whether these models' decision-making processes align with human preferences. Specifically, the researchers constructed a cross-linguistic moral dilemma corpus (referred to as MULTI TP) based on the Moral Machine experiment, which includes over 100 languages, to assess the decision-making processes of LLMs in different linguistic contexts.
### Main Research Objectives
1. **Evaluate the overall consistency of LLMs with human preferences**:
- The researchers quantified the overall consistency of LLMs with human preferences by calculating a global misalignment score. This score is derived by taking a weighted average of individual misalignment scores for each language, with weights based on the number of users of each language.
2. **Analyze LLMs' performance across six major moral dimensions**:
- These six dimensions include species, gender, health status, social status, age, and the number of lives involved. The researchers decomposed the overall misalignment score to understand LLMs' preferences across these six dimensions.
3. **Explore the response differences of LLMs in different languages**:
- The researchers analyzed whether there were significant differences in LLMs' responses across different languages and identified language groups where LLMs exhibited similar behavior.
4. **Test the "language inequality" hypothesis**:
- The researchers explored whether LLMs are more inclined to align with high-resource languages (such as English) rather than low-resource languages (such as some African languages).
5. **Evaluate the consistency of LLMs' responses to different formulations of the same trolley problem prompt**:
- The researchers conducted robustness studies to assess the consistency of LLMs' responses to different formulations of the same trolley problem prompt.
### Research Methods
- **Dataset**: The MULTI TP dataset contains 97,520 trolley problem scenarios, each translated into 107 different languages.
- **Model Selection**: The researchers selected 19 different LLMs, including both open-weight and closed-weight models.
- **Preference Evaluation**: By systematically varying the six moral dimensions, the researchers extracted each model's preferences for each dimension.
- **Misalignment Measurement**: The researchers introduced an overall preference vector p=(pspecies, pgender, pfitness, pstatus, page, pnumber) and measured the misalignment (MIS) score by calculating the L2 distance between the human preference vector ph and the model preference vector pm.
### Main Findings
1. **Overall Consistency**:
- Only a few models (such as Llama 3.1 70B, Llama 3 70B, and Llama 3 8B) had misalignment scores below 0.6, indicating that they were more consistent with human preferences.
2. **Performance Across Moral Dimensions**:
- There was a strong correlation between LLMs' misalignment in the gender, age, and health status dimensions and the overall misalignment. Specifically, the correlation coefficient for the gender dimension was 0.87, for age was 0.69, and for health status was 0.68.
3. **Response Differences Across Languages**:
- There were significant differences in LLMs' responses across different languages, but no clear "language inequality" phenomenon was found, meaning LLMs performed similarly in high-resource and low-resource languages.
4. **Robustness Study**:
- LLMs showed consistency in their responses to different formulations of the same trolley problem prompt, indicating a certain level of stability in handling different formulations.
### Conclusion
This paper systematically evaluated the moral consistency of LLMs in multilingual trolley problems by constructing the MULTI TP dataset. The study found that although most LLMs exhibited significant misalignment with human preferences, their performance across different languages was relatively balanced, with no clear "language inequality" phenomenon. This research highlights the necessity of evaluating the moral consistency of LLMs in a multilingual context and provides important references for future research.