KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma,Xinrun Du,Yunran Wang,Haoran Zhang,Zhoufutu Wen,Xingwei Qu,Jian Yang,Jiaheng Liu,Minghao Liu,Xiang Yue,Wenhao Huang,Ge Zhang

2024-10-18

Abstract:In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), which minimizes the impact of domain-specific knowledge for a more accurate evaluation of models' reasoning abilities in out-of-distribution scenarios. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes the effectiveness of models in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, significantly outperforming Claude-3.5-Sonnet and GPT-4o, which score 58.96% and 58.00%, revealing considerable performance gaps and highlighting KOR-Bench's effectiveness. We conduct thorough analyses to identify bottlenecks in the Cipher task using Stepwise Prompting, discovering that two rounds of Self-Correction yield optimal results. Complex Task Processing evaluates model performance across three integrated tasks, while we also explore the impact of Tricks on the Puzzle task and visualize rule-focused attention to enhance our understanding of model behavior. KOR-Bench aims to enhance reasoning evaluation and support further research in this field.

Databases

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of existing evaluation benchmarks in measuring the reasoning ability of large - language models (LLMs). Specifically, existing evaluation benchmarks often rely on the model's memory of domain - specific knowledge, rather than truly testing the model's ability to understand, follow new rules, and solve problems. These benchmark tests may not be able to accurately distinguish between the model's reasoning ability and the ability to simply recall learned patterns. To address this challenge, the paper proposes the concept of "Knowledge - Orthogonal Reasoning" (KOR) and constructs a new evaluation benchmark - "Knowledge - Orthogonal Reasoning Benchmark" (KOR - Bench) based on this concept. KOR - Bench aims to improve existing evaluation methods in the following ways: 1. **Reduce the influence of domain - specific knowledge**: The rules in KOR - Bench are independent of the domain - specific knowledge that the model has been exposed to during pre - training, ensuring that these rules do not appear in the pre - training data. 2. **Test the model's internal reasoning and planning abilities**: By introducing new elements and rules, KOR - Bench evaluates how the model applies newly defined rules to solve new rule - driven problems, rather than relying on data retrieval or information memory. 3. **Cover multiple reasoning tasks**: KOR - Bench contains five task categories: operational reasoning, logical reasoning, cryptographic reasoning, puzzle reasoning, and counterfactual reasoning. Each task category is based on new symbols, concepts, execution rules, problem - solving frameworks, or story backgrounds. Through these designs, KOR - Bench can more comprehensively and fairly evaluate the model's reasoning ability in dealing with unseen rules and frameworks, thus providing new directions and insights for future research.

KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

CRQBench: A Benchmark of Code Reasoning Questions

ProcessBench: Identifying Process Errors in Mathematical Reasoning

KoRC: Knowledge oriented Reading Comprehension Benchmark for Deep Text Understanding

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

MarkQA: A large scale KBQA dataset with numerical reasoning

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

A NotSo Simple Way to Beat Simple Bench

Go Beyond The Obvious: Probing the gap of INFORMAL reasoning ability between Humanity and LLMs by Detective Reasoning Puzzle Benchmark

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

ACPBench: Reasoning about Action, Change, and Planning

Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models

ARB: Advanced Reasoning Benchmark for Large Language Models

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

KoLA: Carefully Benchmarking World Knowledge of Large Language Models