Revealing the structure of language model capabilities

Ryan Burnell,Han Hao,Andrew R. A. Conway,Jose Hernandez Orallo
2023-06-14
Abstract:Building a theoretical understanding of the capabilities of large language models (LLMs) is vital for our ability to predict and explain the behavior of these systems. Here, we investigate the structure of LLM capabilities by extracting latent capabilities from patterns of individual differences across a varied population of LLMs. Using a combination of Bayesian and frequentist factor analysis, we analyzed data from 29 different LLMs across 27 cognitive tasks. We found evidence that LLM capabilities are not monolithic. Instead, they are better explained by three well-delineated factors that represent reasoning, comprehension and core language modeling. Moreover, we found that these three factors can explain a high proportion of the variance in model performance. These results reveal a consistent structure in the capabilities of different LLMs and demonstrate the multifaceted nature of these capabilities. We also found that the three abilities show different relationships to model properties such as model size and instruction tuning. These patterns help refine our understanding of scaling laws and indicate that changes to a model that improve one ability might simultaneously impair others. Based on these findings, we suggest that benchmarks could be streamlined by focusing on tasks that tap into each broad model ability.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to reveal the ability structure of large - language models (LLMs) in order to establish a theoretical understanding of these models' capabilities. Specifically, the author attempts to solve the following key problems: 1. **Understanding the ability structure of LLMs**: - Are the abilities of LLMs single, unified, or can they be decomposed into multiple independent abilities? - How are these abilities organized and inter - related? 2. **Explaining and predicting the behavior of LLMs**: - How can we better explain and predict the performance of these models in different tasks by understanding the ability structure of LLMs? - How do different model attributes (such as model size, instruction tuning, etc.) affect these abilities? 3. **Improving evaluation benchmarks**: - Although current evaluation benchmarks (such as Big - BENCH and HELM) can test the performance of LLMs in multiple tasks, they lack interpretability and predictability. - How can we improve these evaluation benchmarks to make them more efficient and robust by understanding the ability structure of LLMs? 4. **Exploring the relationship between abilities and model attributes**: - How do factors such as model size, instruction tuning, and the amount of training data affect different abilities of LLMs? - Are there trade - off relationships between certain abilities, that is, does enhancing one ability weaken other abilities? ### Overview of methods To answer these questions, the author used a dataset from the HELM benchmark, which contains the performance of 29 different LLMs on 27 cognitive tasks. They adopted Bayesian and frequency factor analysis methods to extract latent abilities from individual differences and analyzed the relationship between these abilities and model attributes. ### Main findings - **Multidimensional ability structure**: The abilities of LLMs are not single, but can be explained by three main factors: reasoning, comprehension, and core language modeling. - **Relationship between abilities and model attributes**: Model size is positively correlated with all three abilities, but has the strongest correlation with the comprehension ability. Instruction tuning has a positive impact on reasoning ability, but has a negative impact on language modeling ability. - **Task - dependent abilities**: Different types of tasks depend on different abilities. For example, mathematical reasoning and inductive reasoning tasks may depend on a single reasoning ability, while tasks involving comprehension (such as question answering, summary generation) depend on the same underlying comprehension ability. Through these findings, the author hopes to provide a theoretical basis for future research and improve the evaluation and understanding of LLMs.