LLM Circuit Analyses Are Consistent Across Training and Scale

Curt Tigges,Michael Hanna,Qinan Yu,Stella Biderman
2024-07-15
Abstract:Most currently deployed large language models (LLMs) undergo continuous training or additional finetuning. By contrast, most research into LLMs' internal mechanisms focuses on models at one snapshot in time (the end of pre-training), raising the question of whether their results generalize to real-world settings. Existing studies of mechanisms over time focus on encoder-only or toy models, which differ significantly from most deployed models. In this study, we track how model mechanisms, operationalized as circuits, emerge and evolve across 300 billion tokens of training in decoder-only LLMs, in models ranging from 70 million to 2.8 billion parameters. We find that task abilities and the functional components that support them emerge consistently at similar token counts across scale. Moreover, although such components may be implemented by different attention heads over time, the overarching algorithm that they implement remains. Surprisingly, both these algorithms and the types of components involved therein can replicate across model scale. These results suggest that circuit analyses conducted on small models at the end of pre-training can provide insights that still apply after additional pre-training and over model scale.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper primarily explores the consistency of internal mechanisms in large language models (LLMs), particularly during continued training or additional fine-tuning processes. The core issue of the research is to verify whether the results obtained through so-called "circuits" analysis are consistent and generalizable across different model scales and training stages. The key contributions of the paper can be summarized as follows: 1. **Consistency of Task Ability and Functional Components**: The authors found that regardless of model size, the task ability of the model and the functional components supporting these abilities (such as name-moving heads, copy-suppression heads, and successor heads) appear in a similar manner given a similar number of training tokens. This implies that research results from smaller-scale models can guide the analysis of larger-scale models. 2. **Stability of Circuit Algorithms**: Even though individual components may change during training, the overall algorithm implemented by the circuits remains stable. This indicates that some identified circuits may have a certain generalization ability on simple tasks. 3. **Generalization Ability of Circuit Analysis**: The research results show that circuit analysis can effectively span across training and different model scales, even in the face of changes in component and circuit sizes. Therefore, research findings from the late training stages of small-scale models can sometimes be applied to larger-scale models and models trained for longer periods. To achieve the above goals, the authors adopted the following methods: - **Circuits**: Defined as the minimal computational subgraphs in the model that explain the mechanisms for solving specific tasks. Circuits can be found and verified through various methods to see if they faithfully reflect the model's behavior. - **Circuit Finding**: Using efficient attribution-based methods to identify circuits in large-scale models, particularly using Edge Attribution Patching with Integrated Gradients (EAP-IG) to assess the importance of all edges in the model graph. - **Model Selection**: The study examined models of different scales from the Pythia model suite, ranging from 70 million to 2.8 billion parameters, trained on 300 billion tokens of training data. - **Task Selection**: Four simple tasks were chosen for analysis, including Indirect Object Identification (IOI), gendered pronoun tasks, greater-than tasks, and subject-verb agreement tasks. Through the above analysis, the authors found that models of different scales can achieve similar task performance levels given similar amounts of training, and the key components supporting these task performances also appear at similar time points. Additionally, although the specific components constituting the circuits may change over time, the overall algorithm they implement remains relatively stable. These findings help to understand the consistency and stability of internal mechanisms in large-scale language models, providing valuable insights for further research.