Abstract:Eliciting reasoning capabilities from language models (LMs) is a critical direction on the path towards building intelligent systems. Most recent studies dedicated to reasoning focus on out-of-distribution performance on procedurally-generated synthetic benchmarks, bespoke-built to evaluate specific skills only. This trend makes results hard to transfer across publications, slowing down progress. Three years ago, a similar issue was identified and rectified in the field of neural algorithmic reasoning, with the advent of the CLRS benchmark. CLRS is a dataset generator comprising graph execution traces of classical algorithms from the Introduction to Algorithms textbook. Inspired by this, we propose CLRS-Text -- a textual version of these algorithmic traces. Out of the box, CLRS-Text is capable of procedurally generating trace data for thirty diverse, challenging algorithmic tasks across any desirable input distribution, while offering a standard pipeline in which any additional algorithmic tasks may be created in the benchmark. We fine-tune and evaluate various LMs as generalist executors on this benchmark, validating prior work and revealing a novel, interesting challenge for the LM reasoning community. Our code is available at <a class="link-external link-https" href="https://github.com/google-deepmind/clrs/tree/master/clrs/_src/clrs_text" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to evaluate and improve the capabilities of language models (LMs) in algorithmic reasoning tasks**. Specifically, the author focuses on the performance of language models when handling tasks that require reasoning, especially those tasks involving multi - step logical reasoning, mathematical operations, and complex algorithm execution. ### Background and Motivation 1. **Existing Problems**: - Language models perform well in various scenarios, but poorly in tasks that require reasoning. For example, they have obvious deficiencies in handling inverse concepts, basic arithmetic, geometric problems, and identifying high - order languages. - Existing reasoning evaluation methods mainly rely on static datasets, which are prone to overfitting and cannot truly reflect the generalization ability of the model. - Different studies have constructed their own benchmark datasets, making it difficult to compare results across studies and hindering progress. 2. **Inspiration from the CLRS Benchmark**: - Three years ago, similar problems occurred in the field of neural algorithmic reasoning and were solved by introducing the CLRS benchmark. CLRS is a dataset generator that generates execution trajectories of classic algorithms, covering 30 classic algorithms in "Introduction to Algorithms". - The success of CLRS inspired the author to propose CLRS - Text, the text version of CLRS, for evaluating the reasoning ability of language models. ### Goals of CLRS - Text CLRS - Text aims to: - **Provide a unified benchmark** for evaluating the performance of language models in multiple algorithmic reasoning tasks. - **Generate diverse task instances** covering 30 challenging algorithmic tasks with different input distributions. - **Support zero - shot and few - shot evaluations** to ensure the generalization ability of the model on unseen data. - **Simplify cross - study comparisons** so that different studies can use the same benchmark for fair comparison. ### Methods 1. **Dataset Construction**: - CLRS - Text converts the graph representation in CLRS into a text representation, making it suitable for language model processing. - The input, output, and intermediate state (trace) of each algorithmic task are converted into a text format, and the model predicts the final result based on this information. 2. **Training and Evaluation**: - Use the Gemma 2B model for pre - training and evaluate it on different problem scales. - The evaluation includes zero - shot and few - shot settings to test the generalization ability of the model. - Avoid the "reasoning gap" brought by static datasets by resampling test data points. ### Results and Discussion - **Experimental results** show that using random position embeddings (RPE) can improve the generalization ability of the model, especially in length generalization. - **Compared with general - purpose models** (such as Gemini 1.5 Flash), the fine - tuned Gemma 2B performs better on CLRS - Text tasks, showing the advantages of dedicated models in specific tasks. - **Future work directions** include further optimizing the model architecture and exploring more effective reasoning mechanisms, especially for the limitations of autoregressive language models in multi - step reasoning tasks. In conclusion, CLRS - Text provides an important and unified benchmark for evaluating and improving the algorithmic reasoning ability of language models, which is helpful to promote the research progress in this field.

The CLRS-Text Algorithmic Reasoning Language Benchmark

The CLRS Algorithmic Reasoning Benchmark

CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text

SALSA-CLRS: A Sparse and Scalable Benchmark for Algorithmic Reasoning

CLadder: Assessing Causal Reasoning in Language Models

CRQBench: A Benchmark of Code Reasoning Questions

DR.BENCH: Diagnostic Reasoning Benchmark for Clinical Natural Language Processing

Benchmarking Large Language Models for Math Reasoning Tasks

StrucText-Eval: Evaluating Large Language Model's Reasoning Ability in Structure-Rich Text

SLR: A million-scale comprehensive crossword dataset for simultaneous learning and reasoning

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Learning to Reason for Text Generation from Scientific Tables

MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks

CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation

The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models

PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

Linguini: A benchmark for language-agnostic linguistic reasoning