Abstract:Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to reveal the ability structure of large - language models (LLMs) in order to establish a theoretical understanding of these models' capabilities. Specifically, the author attempts to solve the following key problems: 1. **Understanding the ability structure of LLMs**: - Are the abilities of LLMs single, unified, or can they be decomposed into multiple independent abilities? - How are these abilities organized and inter - related? 2. **Explaining and predicting the behavior of LLMs**: - How can we better explain and predict the performance of these models in different tasks by understanding the ability structure of LLMs? - How do different model attributes (such as model size, instruction tuning, etc.) affect these abilities? 3. **Improving evaluation benchmarks**: - Although current evaluation benchmarks (such as Big - BENCH and HELM) can test the performance of LLMs in multiple tasks, they lack interpretability and predictability. - How can we improve these evaluation benchmarks to make them more efficient and robust by understanding the ability structure of LLMs? 4. **Exploring the relationship between abilities and model attributes**: - How do factors such as model size, instruction tuning, and the amount of training data affect different abilities of LLMs? - Are there trade - off relationships between certain abilities, that is, does enhancing one ability weaken other abilities? ### Overview of methods To answer these questions, the author used a dataset from the HELM benchmark, which contains the performance of 29 different LLMs on 27 cognitive tasks. They adopted Bayesian and frequency factor analysis methods to extract latent abilities from individual differences and analyzed the relationship between these abilities and model attributes. ### Main findings - **Multidimensional ability structure**: The abilities of LLMs are not single, but can be explained by three main factors: reasoning, comprehension, and core language modeling. - **Relationship between abilities and model attributes**: Model size is positively correlated with all three abilities, but has the strongest correlation with the comprehension ability. Instruction tuning has a positive impact on reasoning ability, but has a negative impact on language modeling ability. - **Task - dependent abilities**: Different types of tasks depend on different abilities. For example, mathematical reasoning and inductive reasoning tasks may depend on a single reasoning ability, while tasks involving comprehension (such as question answering, summary generation) depend on the same underlying comprehension ability. Through these findings, the author hopes to provide a theoretical basis for future research and improve the evaluation and understanding of LLMs.

Revealing the structure of language model capabilities

Law of the Weakest Link: Cross Capabilities of Large Language Models

FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition

On Evaluating LLMs' Capabilities as Functional Approximators: A Bayesian Perspective

Large Language Models as Neurolinguistic Subjects: Identifying Internal Representations for Form and Meaning

On the Unexpected Abilities of Large Language Models

Case Study: Testing Model Capabilities in Some Reasoning Tasks

Investigating Symbolic Capabilities of Large Language Models

Dissociating language and thought in large language models: a cognitive perspective

Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability

Is Bigger and Deeper Always Better? Probing LLaMA Across Scales and Layers

Analysis of hybrid imaging techniques

Exploring the LLM Journey from Cognition to Expression with Linear Representations

Evaluating Consistency and Reasoning Capabilities of Large Language Models

Birth defects among children born to a population occupationally exposed to pesticides in Colombia.

Auxiliary task demands mask the capabilities of smaller language models

CogBench: a large language model walks into a psychology lab

Exploring and Benchmarking the Planning Capabilities of Large Language Models

Towards Uncovering How Large Language Model Works: An Explainability Perspective

Perceived Exertion in Different Strength Exercise Loads in Sedentary, Active, and Trained Adults

Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers