Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

Aditi Khandelwal,Utkarsh Agarwal,Kumar Tanmay,Monojit Choudhury
2024-02-03
Abstract:This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test. It is a well known fact that moral judgment depends on the language in which the question is asked. We extend the work of beyond English, to 5 new languages (Chinese, Hindi, Russian, Spanish and Swahili), and probe three LLMs -- ChatGPT, GPT-4 and Llama2Chat-70B -- that shows substantial multilingual text processing and generation abilities. Our study shows that the moral reasoning ability for all models, as indicated by the post-conventional score, is substantially inferior for Hindi and Swahili, compared to Spanish, Russian, Chinese and English, while there is no clear trend for the performance of the latter four languages. The moral judgments too vary considerably by the language.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores the moral judgment and moral reasoning abilities of large language models (LLMs) in different languages, using the multi-language version of the Defining Issues Test (DIT) for the study. ### Research Background and Objectives - **Research Background**: Previous studies have shown that humans exhibit different moral judgments when faced with moral dilemmas expressed in different languages. This study is based on the work of Tanmay et al. in 2023, which used the DIT to assess the moral reasoning abilities of LLMs in an English environment. - **Research Objectives**: To extend the 2023 study by Tanmay et al., exploring whether the moral judgment and reasoning abilities of LLMs change in multilingual environments and the reasons behind these changes. Specifically, the study focuses on how language influences the moral judgment and reasoning abilities of LLMs. ### Main Findings - **Research Method**: The authors selected five languages (Spanish, Russian, Chinese, Hindi, and Swahili) and conducted experiments on three LLMs (ChatGPT, GPT-4, and Llama2Chat-70B). The experiments used five classic moral dilemma cases from the DIT and four new cases proposed by Tanmay et al. - **Experimental Design**: By translating these cases into the selected languages and presenting these moral dilemmas and their related ethical considerations to the LLMs, the study evaluated the moral judgment and reasoning abilities of the LLMs. The experimental results included the models' choices of solutions to the moral dilemmas and the four most important moral considerations. - **Key Observations**: - GPT-4 demonstrated the best multilingual moral reasoning ability, showing relatively consistent moral judgments and scores across all languages. - For ChatGPT and Llama2Chat-70B, their performance varied significantly between different languages, especially in Hindi, where their performance was close to a random baseline. - Among all models, the best performance was observed in English and Spanish, followed by Russian, Chinese, Swahili, and Hindi. - Although GPT-4 had high moral scores in both English and Russian, there were significant differences in moral judgments between the two languages. ### Conclusion This paper systematically studies, for the first time, the ability of LLMs to handle moral dilemmas in different language environments through the multi-language version of the DIT test. The research reveals the impact of different languages on the moral judgment and reasoning abilities of LLMs and identifies performance differences of LLMs in various languages. Additionally, the study has created a multilingual version of the moral dilemma case set, which will aid future research in related fields.