LLMs for Relational Reasoning: How Far are We?

Zhiming Li,Yushi Cao,Xiufeng Xu,Junzhe Jiang,Xu Liu,Yon Shin Teo,Shang-wei Lin,Yang Liu
2024-01-17
Abstract:Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general artificial intelligence, there has been a surge of interest in investigating the reasoning ability of the LLMs. Whereas the textual and numerical reasoning benchmarks adopted by previous works are rather shallow and simple, it is hard to conclude that the LLMs possess strong reasoning ability by merely achieving positive results on these benchmarks. Recent efforts have demonstrated that the LLMs are poor at solving sequential decision-making problems that require common-sense planning by evaluating their performance on the reinforcement learning benchmarks. In this work, we conduct an in-depth assessment of several state-of-the-art LLMs' reasoning ability based on the inductive logic programming (ILP) benchmark, which is broadly recognized as a representative and challenging measurement for evaluating logic program induction/synthesis systems as it requires inducing strict cause-effect logic to achieve robust deduction on independent and identically distributed (IID) and out-of-distribution (OOD) test samples. Our evaluations illustrate that compared with the neural program induction systems which are much smaller in model size, the state-of-the-art LLMs are much poorer in terms of reasoning ability by achieving much lower performance and generalization using either natural language prompting or truth-value matrix prompting.
Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the capabilities of large - language models (LLMs) in relational reasoning. In particular, compared with neural program induction models (Neural Program Induction, NPI), how do LLMs perform in tasks requiring logical reasoning. The paper points out that although LLMs have made remarkable progress in natural - language processing, software engineering and other fields and have performed excellently in many benchmark tests, these benchmark tests are often relatively simple and cannot comprehensively evaluate the reasoning capabilities of LLMs. Therefore, the author uses inductive logic programming (Inductive Logic Programming, ILP) benchmarks to deeply evaluate the relational reasoning capabilities of several state - of - the - art LLMs, in order to more accurately measure the performance of LLMs in complex logical reasoning tasks. Specifically, the paper focuses on the following research questions: 1. How are the relational reasoning capabilities of LLMs under standard natural - language prompts? 2. How are the relational reasoning capabilities of LLMs under truth - value matrix prompts? 3. Can the latest prompting techniques effectively improve the relational reasoning capabilities of LLMs? To answer these questions, the author designed a general evaluation pipeline, including a sample generator, a modality compiler and an evaluation module, to evaluate the performance of LLMs and NPI models on two benchmark tasks: family - tree reasoning and general - graph reasoning. The evaluation results show that although LLMs perform well on some relatively simple tasks, their performance is significantly inferior to that of NPI models on tasks requiring complex logical reasoning. This indicates that current LLMs still have much room for improvement in terms of relational reasoning capabilities.