Revisiting the Superficial Alignment Hypothesis

Mohit Raghavendra,Vaskar Nath,Sean Hendryx
2024-09-28
Abstract:The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model's ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.
Computation and Language,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to re - examine and verify the "Superficial Alignment Hypothesis" and explore the performance of large - language models (LLMs) in the fine - tuning stage and the mechanism of their capacity improvement. Specifically, the researchers focus on the following key issues: 1. **How does the performance of the fine - tuned model change with the size of the data set?** - The researchers experimentally verified the power - law relationship between the performance of the fine - tuned model and the amount of data, that is: \[ P\propto D^{1 / b} \] where \(P\) is the task performance, \(D\) is the number of fine - tuning samples, and \(b\) is a constant. 2. **Does the model significantly improve task - related abilities, or does it only learn the response style?** - Through in - depth analysis of tasks such as mathematical reasoning and multi - step reasoning, the researchers found that during the fine - tuning process, the model not only improves the format and style but also significantly enhances its reasoning ability and task - execution ability. 3. **Can the model integrate new knowledge beyond the pre - training knowledge cut - off date?** - Experiments show that through appropriate fine - tuning or retrieval - augmented generation (RAG), the model can effectively learn and utilize new knowledge, especially in multi - step reasoning tasks. ### Main contributions of the paper - **Re - evaluating the Superficial Alignment Hypothesis**: The research shows that the Superficial Alignment Hypothesis is overly simplified and ignores the improvement of the model's reasoning ability and new - knowledge - integration ability during the fine - tuning process. - **Proposing a more comprehensive evaluation method**: Emphasize using objective task - specific benchmark tests to evaluate model performance rather than relying solely on subjective win - rate comparisons. - **Demonstrating the substantial improvement of the model's ability by fine - tuning**: Experiments on multiple model families and tasks prove that fine - tuning can not only improve the model's style and format but also significantly improve its reasoning and task - execution ability. - **Exploring the learning and integration of new knowledge**: It is shown that fine - tuning can help the model overcome the problem of pre - training knowledge cut - off and better utilize new knowledge. ### Conclusion This research shows that fine - tuning not only makes the model adapt to a certain style or format but also can significantly improve its reasoning ability and task - execution ability. Therefore, future fine - tuning work should pay more attention to the improvement of task - specific abilities rather than just superficial alignment. In addition, the research also points out effective methods for introducing new knowledge, such as further fine - tuning and retrieval - augmented generation, which are of great significance for expanding the knowledge boundaries of the model.