Abstract:Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to study how to improve the reasoning ability of large language models (LLMs) in complex tasks using chess as a test platform. Specifically, the authors observed that high-level chess players adopt a method that combines long-term strategy and short-term tactics, supplemented by language explanations. Based on this observation, the paper proposes enhancing the reasoning ability of large language models in chess by integrating annotated strategic and tactical information. ### Main Contributions 1. **Collection of High-Quality Dataset**: The authors constructed a dataset named MATE, containing approximately 1 million chess positions. Each position is annotated with strategic and tactical information of candidate moves by experienced chess players, including world champion-level experts. 2. **Improvement of Reasoning Ability through Language Explanations**: The study found that language explanations can significantly improve the reasoning ability of large language models. 3. **Integration of Strategy and Tactics Dual Mode**: By integrating long-term strategy and short-term tactics, the performance of language models in chess can be further improved. ### Experimental Setup - **Base Model**: The pre-trained LLaMA-3-8B model is used as the base model. - **Fine-Tuning**: Fine-tuning is performed using llamafactory, with a cosine learning rate scheduler, a maximum learning rate of 5×10^-6, and training for 5 epochs. - **Evaluation Method**: The model performance is evaluated on different subsets of the MATE dataset, including zero-shot learning and few-shot learning settings. ### Experimental Results - **Complexity of the MATE Dataset**: The experimental results show that the MATE dataset is sufficiently complex to distinguish the performance of different commercial LLMs. - **Role of Language Explanations**: Most tested LLMs showed improved performance after providing language explanations, especially performing best on the MATE-Strategy&Tactic subset. - **Integration of Strategy and Tactics**: Most models performed better on the MATE-Strategy&Tactic subset compared to other subsets. For example, gpt-4o's performance in the zero-shot setting improved by 10%, 14%, and 30% compared to MATE-T, MATE-S, and MATE-N, respectively. ### Conclusion The authors propose a method to enhance the reasoning ability of large language models in chess by integrating strategic and tactical annotations. The experimental results indicate that language explanations help improve the reasoning ability of the models, and the dual-mode approach of combining long-term strategy and short-term tactics has significant advantages in improving model performance. Future research can apply this method to other tasks to further enhance the reasoning ability of language models.

Explore the Reasoning Capability of LLMs in the Chess Testbed

GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents

Case Study: Testing Model Capabilities in Some Reasoning Tasks

GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations

Large Language Models on the Chessboard: A Study on ChatGPT's Formal Language Comprehension and Complex Reasoning Skills

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study

TMGBench: A Systematic Game Benchmark for Evaluating Strategic Reasoning Abilities of LLMs

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

On Memorization of Large Language Models in Logical Reasoning

GLoRE: Evaluating Logical Reasoning of Large Language Models

LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models

LLMs for Relational Reasoning: How Far are We?

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Can LLMs Reason in the Wild with Programs?

LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models

LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance