Abstract:The capabilities and limitations of Large Language Models have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation: train-test set separation. To overcome this, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents influence the model outputs for three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, a document often has a similar influence across different reasoning questions within the same task, indicating the presence of procedural knowledge. We further find that the answers to factual questions often show up in the most influential data. However, for reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top ranked documents for the reasoning questions qualitatively, we confirm that the influential documents often contain procedural knowledge, like demonstrating how to obtain a solution using formulae or code. Our findings indicate that the approach to reasoning the models use is unlike retrieval, and more like a generalisable strategy that synthesises procedural knowledge from documents doing a similar form of reasoning.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how large language models (LLMs) learn reasoning abilities from pre - training data. Specifically, the paper explores the generalization strategies adopted by LLMs when performing reasoning tasks, and the relationship between these strategies and the program knowledge extracted from pre - training data. By studying the types of data that two models of different scales (with 7B and 35B parameters) rely on when dealing with simple mathematical reasoning tasks, the paper investigates how the models learn and apply program knowledge, and whether this learning method is different from simple data retrieval. ### Main problems of the paper 1. **How LLMs learn reasoning from pre - training data**: - The author explores how the models learn and apply program knowledge by analyzing the pre - training data on which the models rely when completing reasoning tasks. - Specifically, the paper studies which pre - training data have an impact on the reasoning process of the models when the models are dealing with three mathematical reasoning tasks (two - step arithmetic, calculating slope, solving linear equations). 2. **Whether the strategies used by the models in reasoning tasks are different from data retrieval**: - The author compares the types and amounts of data on which the models rely when dealing with factual questions and reasoning questions to determine whether the models adopt more generalized strategies in reasoning tasks rather than simply retrieving answers from pre - training data. 3. **The role of program knowledge in reasoning**: - The author finds that for the same mathematical tasks, different reasoning questions often rely on similar pre - training data, indicating that the models use program knowledge in the reasoning process. - This program knowledge usually exists in the form of code or mathematical formulas in the pre - training data, and the models complete reasoning tasks by learning this program knowledge. ### Main findings 1. **Program knowledge drives reasoning**: - For the same mathematical tasks, there is a significant positive correlation in the influence of pre - training data for different reasoning questions, indicating that the models rely on program knowledge in the reasoning process. - For example, in the task of calculating slope, the documents on which the model relies often contain code or mathematical formulas for calculating slope. 2. **The models rely on fewer and more abstract documents when reasoning**: - Compared with factual questions, the influence of a single document on which the model relies in reasoning tasks is lower, and the set of documents on which it relies is more extensive and abstract. - This indicates that the models adopt a more generalized strategy when reasoning rather than simply retrieving specific answers. 3. **The answers to factual questions are more common in pre - training data**: - For factual questions, the answers often appear in the top 0.01% of the pre - training data, while the answers to reasoning questions rarely appear in these high - influence documents. 4. **Code plays an important role in mathematical reasoning**: - Code data is significantly over - represented in high - influence documents for reasoning tasks, especially in mathematical reasoning tasks. ### Conclusion The conclusion of the paper is that large language models rely not only on specific answers but also on the program knowledge extracted from pre - training data in reasoning tasks. This program knowledge enables the models to solve new reasoning problems by applying similar processes rather than simply retrieving answers from pre - training data. This finding is of great significance for future pre - training data selection strategies, that is, high - quality program knowledge data may be more effective than covering every specific case.

Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models

Concise and Organized Perception Facilitates Large Language Models for Deductive Reasoning.

Reasoning with Large Language Models, a Survey

Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships

Towards Reasoning in Large Language Models: A Survey

Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences

LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning

CLR-Fact: Evaluating the Complex Logical Reasoning Capability of Large Language Models over Factual Knowledge

Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought

Exploring the Role of Reasoning Structures for Constructing Proofs in Multi-Step Natural Language Reasoning with Large Language Models

Large Language Models Are Cross-Lingual Knowledge-Free Reasoners

On Exploring the Reasoning Capability of Large Language Models with Knowledge Graphs

Reasoning Factual Knowledge in Structured Data with Large Language Models

Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation

On Memorization of Large Language Models in Logical Reasoning

Which Programming Language and What Features at Pre-training Stage Affect Downstream Logical Inference Performance?

Teaching Smaller Language Models To Generalise To Unseen Compositional Questions (Full Thesis)

Are Large Language Models Really Good Logical Reasoners? A Comprehensive Evaluation and Beyond

Meaningful Learning: Advancing Abstract Reasoning in Large Language Models Via Generic Fact Guidance.