KOR-Bench: Benchmarking Language Models on Knowledge-Orthogonal Reasoning Tasks

Kaijing Ma,Xinrun Du,Yunran Wang,Haoran Zhang,Zhoufutu Wen,Xingwei Qu,Jian Yang,Jiaheng Liu,Minghao Liu,Xiang Yue,Wenhao Huang,Ge Zhang
2024-10-18
Abstract:In this paper, we introduce Knowledge-Orthogonal Reasoning (KOR), which minimizes the impact of domain-specific knowledge for a more accurate evaluation of models' reasoning abilities in out-of-distribution scenarios. Based on this concept, we propose the Knowledge-Orthogonal Reasoning Benchmark (KOR-Bench), encompassing five task categories: Operation, Logic, Cipher, Puzzle, and Counterfactual. KOR-Bench emphasizes the effectiveness of models in applying new rule descriptions to solve novel rule-driven questions. O1-Preview and O1-Mini achieve accuracies of 72.88% and 70.16%, significantly outperforming Claude-3.5-Sonnet and GPT-4o, which score 58.96% and 58.00%, revealing considerable performance gaps and highlighting KOR-Bench's effectiveness. We conduct thorough analyses to identify bottlenecks in the Cipher task using Stepwise Prompting, discovering that two rounds of Self-Correction yield optimal results. Complex Task Processing evaluates model performance across three integrated tasks, while we also explore the impact of Tricks on the Puzzle task and visualize rule-focused attention to enhance our understanding of model behavior. KOR-Bench aims to enhance reasoning evaluation and support further research in this field.
Databases
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of existing evaluation benchmarks in measuring the reasoning ability of large - language models (LLMs). Specifically, existing evaluation benchmarks often rely on the model's memory of domain - specific knowledge, rather than truly testing the model's ability to understand, follow new rules, and solve problems. These benchmark tests may not be able to accurately distinguish between the model's reasoning ability and the ability to simply recall learned patterns. To address this challenge, the paper proposes the concept of "Knowledge - Orthogonal Reasoning" (KOR) and constructs a new evaluation benchmark - "Knowledge - Orthogonal Reasoning Benchmark" (KOR - Bench) based on this concept. KOR - Bench aims to improve existing evaluation methods in the following ways: 1. **Reduce the influence of domain - specific knowledge**: The rules in KOR - Bench are independent of the domain - specific knowledge that the model has been exposed to during pre - training, ensuring that these rules do not appear in the pre - training data. 2. **Test the model's internal reasoning and planning abilities**: By introducing new elements and rules, KOR - Bench evaluates how the model applies newly defined rules to solve new rule - driven problems, rather than relying on data retrieval or information memory. 3. **Cover multiple reasoning tasks**: KOR - Bench contains five task categories: operational reasoning, logical reasoning, cryptographic reasoning, puzzle reasoning, and counterfactual reasoning. Each task category is based on new symbols, concepts, execution rules, problem - solving frameworks, or story backgrounds. Through these designs, KOR - Bench can more comprehensively and fairly evaluate the model's reasoning ability in dealing with unseen rules and frameworks, thus providing new directions and insights for future research.