Explore the Reasoning Capability of LLMs in the Chess Testbed

Shu Wang,Lei Ji,Renxi Wang,Wenxiao Zhao,Haokun Liu,Yifan Hou,Ying Nian Wu
2024-11-11
Abstract:Reasoning is a central capability of human intelligence. In recent years, with the advent of large-scale datasets, pretrained large language models have emerged with new capabilities, including reasoning. However, these models still struggle with long-term, complex reasoning tasks, such as playing chess. Based on the observation that expert chess players employ a dual approach combining long-term strategic play with short-term tactical play along with language explanation, we propose improving the reasoning capability of large language models in chess by integrating annotated strategy and tactic. Specifically, we collect a dataset named MATE, which consists of 1 million chess positions with candidate moves annotated by chess experts for strategy and tactics. We finetune the LLaMA-3-8B model and compare it against state-of-the-art commercial language models in the task of selecting better chess moves. Our experiments show that our models perform better than GPT, Claude, and Gemini models. We find that language explanations can enhance the reasoning capability of large language models.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to study how to improve the reasoning ability of large language models (LLMs) in complex tasks using chess as a test platform. Specifically, the authors observed that high-level chess players adopt a method that combines long-term strategy and short-term tactics, supplemented by language explanations. Based on this observation, the paper proposes enhancing the reasoning ability of large language models in chess by integrating annotated strategic and tactical information. ### Main Contributions 1. **Collection of High-Quality Dataset**: The authors constructed a dataset named MATE, containing approximately 1 million chess positions. Each position is annotated with strategic and tactical information of candidate moves by experienced chess players, including world champion-level experts. 2. **Improvement of Reasoning Ability through Language Explanations**: The study found that language explanations can significantly improve the reasoning ability of large language models. 3. **Integration of Strategy and Tactics Dual Mode**: By integrating long-term strategy and short-term tactics, the performance of language models in chess can be further improved. ### Experimental Setup - **Base Model**: The pre-trained LLaMA-3-8B model is used as the base model. - **Fine-Tuning**: Fine-tuning is performed using llamafactory, with a cosine learning rate scheduler, a maximum learning rate of 5×10^-6, and training for 5 epochs. - **Evaluation Method**: The model performance is evaluated on different subsets of the MATE dataset, including zero-shot learning and few-shot learning settings. ### Experimental Results - **Complexity of the MATE Dataset**: The experimental results show that the MATE dataset is sufficiently complex to distinguish the performance of different commercial LLMs. - **Role of Language Explanations**: Most tested LLMs showed improved performance after providing language explanations, especially performing best on the MATE-Strategy&Tactic subset. - **Integration of Strategy and Tactics**: Most models performed better on the MATE-Strategy&Tactic subset compared to other subsets. For example, gpt-4o's performance in the zero-shot setting improved by 10%, 14%, and 30% compared to MATE-T, MATE-S, and MATE-N, respectively. ### Conclusion The authors propose a method to enhance the reasoning ability of large language models in chess by integrating strategic and tactical annotations. The experimental results indicate that language explanations help improve the reasoning ability of the models, and the dual-mode approach of combining long-term strategy and short-term tactics has significant advantages in improving model performance. Future research can apply this method to other tasks to further enhance the reasoning ability of language models.