Evaluating Large Language Models with NeuBAROCO: Syllogistic Reasoning Ability and Human-like Biases

Risako Ando,Takanobu Morishita,Hirohiko Abe,Koji Mineshima,Mitsuhiro Okada
2023-06-22
Abstract:This paper investigates whether current large language models exhibit biases in logical reasoning, similar to humans. Specifically, we focus on syllogistic reasoning, a well-studied form of inference in the cognitive science of human deduction. To facilitate our analysis, we introduce a dataset called NeuBAROCO, originally designed for psychological experiments that assess human logical abilities in syllogistic reasoning. The dataset consists of syllogistic inferences in both English and Japanese. We examine three types of biases observed in human syllogistic reasoning: belief biases, conversion errors, and atmosphere effects. Our findings demonstrate that current large language models struggle more with problems involving these three types of biases.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is whether current large - language models exhibit biases similar to those of humans in logical reasoning. Specifically, the research focuses on syllogistic reasoning, a form of reasoning that has been widely studied in human cognitive science. For the analysis, the author introduced a dataset named NeuBAROCO, which was originally designed for psychological experiments to evaluate human logical abilities in syllogistic reasoning. The dataset contains syllogistic reasoning in English and Japanese. The research focuses on examining three types of human syllogistic - reasoning biases: belief bias, conversion error, and atmosphere effect. The study found that current large - language models have more difficulty when dealing with problems involving these three biases. The main contributions of the paper include: 1. Proposing the NeuBAROCO dataset, specifically designed for syllogistic reasoning, which can be a valuable resource for evaluating human biases in language models. 2. Using this dataset to evaluate the logical - reasoning abilities of several of the latest large - language models in English and Japanese. 3. The evaluation results show that current large - language models have significant deficiencies when faced with wrong problems that are likely to lead to the above three biases. Through these studies, the author hopes to further understand the performance of large - language models in logical reasoning and explore the differences between them and human reasoning.