Abstract:The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is about the effectiveness and necessity of alignment tuning for large - language models (LLMs). Specifically, the paper explores whether alignment tuning through supervised fine - tuning (SFT) and reinforcement learning based on human feedback (RLHF) really changes the knowledge and reasoning abilities of the underlying LLMs, or mainly affects their language styles and interaction modes. The author proposes a hypothesis that alignment tuning may be mainly "superficial", that is, it is more about teaching the underlying LLMs to adopt a specific language style rather than substantially increasing their knowledge or improving their reasoning abilities. To verify this hypothesis, the paper analyzes the changes in the token distributions generated by LLMs during the decoding process before and after alignment. The study finds that the top - ranked tokens of the underlying LLMs and the aligned LLMs are almost the same at most positions, and the distribution changes mainly occur in stylized tokens, such as discourse markers, safety statements, etc. These results support the "superficial alignment hypothesis", that is, alignment tuning is mainly adjusting the language style of the model rather than its core knowledge and reasoning abilities. Based on these findings, the paper further proposes an alignment method without fine - tuning - URIAL (Untuned LLMs with Restyled In - context ALignment). URIAL can effectively align the underlying LLMs only through in - context learning (ICL) by using a small number of carefully designed stylized examples and system prompts. The experimental results show that URIAL can not only match but even exceed the performance of LLMs aligned through SFT or SFT + RLHF, and significantly reduces the performance gap between the non - tuning and tuning alignment methods. In conclusion, this paper aims to re - think the process of alignment tuning and explore whether effective alignment can be achieved through simpler methods, thereby reducing the dependence on complex tuning processes. This not only helps to deepen the understanding of the LLM alignment mechanism but also provides a new direction for future LLM research.

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models

Aligners: Decoupling LLMs and Alignment

Is In-Context Learning Sufficient for Instruction Following in LLMs?

Your Weak LLM is Secretly a Strong Teacher for Alignment

Pedagogical Alignment of Large Language Models

Beyond Imitation: Leveraging Fine-grained Quality Signals for Alignment

Human-Instruction-Free LLM Self-Alignment with Limited Samples

L3Ms -- Lagrange Large Language Models

Alignment at Pre-training! Towards Native Alignment for Arabic LLMs

Aligning Large Language Models with Representation Editing: A Control Perspective

InfAlign: Inference-aware language model alignment

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

AlignBench: Benchmarking Chinese Alignment of Large Language Models

UFT: Unifying Fine-Tuning of SFT and RLHF/DPO/UNA through a Generalized Implicit Reward Function

MetaAlign: Align Large Language Models with Diverse Preferences during Inference Time

Self-Alignment: Improving Alignment of Cultural Values in LLMs via In-Context Learning

Inference time LLM alignment in single and multidomain preference spectrum

Does Alignment Tuning Really Break LLMs' Internal Confidence?

InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning

Making Large Language Models Better Reasoners with Alignment