The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Bill Yuchen Lin,Abhilasha Ravichander,Ximing Lu,Nouha Dziri,Melanie Sclar,Khyathi Chandu,Chandra Bhagavatula,Yejin Choi
2023-12-04
Abstract:The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is about the effectiveness and necessity of alignment tuning for large - language models (LLMs). Specifically, the paper explores whether alignment tuning through supervised fine - tuning (SFT) and reinforcement learning based on human feedback (RLHF) really changes the knowledge and reasoning abilities of the underlying LLMs, or mainly affects their language styles and interaction modes. The author proposes a hypothesis that alignment tuning may be mainly "superficial", that is, it is more about teaching the underlying LLMs to adopt a specific language style rather than substantially increasing their knowledge or improving their reasoning abilities. To verify this hypothesis, the paper analyzes the changes in the token distributions generated by LLMs during the decoding process before and after alignment. The study finds that the top - ranked tokens of the underlying LLMs and the aligned LLMs are almost the same at most positions, and the distribution changes mainly occur in stylized tokens, such as discourse markers, safety statements, etc. These results support the "superficial alignment hypothesis", that is, alignment tuning is mainly adjusting the language style of the model rather than its core knowledge and reasoning abilities. Based on these findings, the paper further proposes an alignment method without fine - tuning - URIAL (Untuned LLMs with Restyled In - context ALignment). URIAL can effectively align the underlying LLMs only through in - context learning (ICL) by using a small number of carefully designed stylized examples and system prompts. The experimental results show that URIAL can not only match but even exceed the performance of LLMs aligned through SFT or SFT + RLHF, and significantly reduces the performance gap between the non - tuning and tuning alignment methods. In conclusion, this paper aims to re - think the process of alignment tuning and explore whether effective alignment can be achieved through simpler methods, thereby reducing the dependence on complex tuning processes. This not only helps to deepen the understanding of the LLM alignment mechanism but also provides a new direction for future LLM research.