Abstract:The alignment tuning process of large language models (LLMs) typically involves instruction learning through supervised fine-tuning (SFT) and preference tuning via reinforcement learning from human feedback (RLHF). A recent study, LIMA (Zhou et al. 2023), shows that using merely 1K examples for SFT can achieve significant alignment performance as well, suggesting that the effect of alignment tuning might be "superficial." This raises questions about how exactly the alignment tuning transforms a base LLM. We analyze the effect of alignment tuning by examining the token distribution shift between base LLMs and their aligned counterpart. Our findings reveal that base LLMs and their alignment-tuned versions perform nearly identically in decoding on the majority of token positions. Most distribution shifts occur with stylistic tokens. These direct evidence strongly supports the Superficial Alignment Hypothesis suggested by LIMA. Based on these findings, we rethink the alignment of LLMs by posing the research question: how effectively can we align base LLMs without SFT or RLHF? To address this, we introduce a simple, tuning-free alignment method, URIAL. URIAL achieves effective alignment purely through in-context learning (ICL) with base LLMs, requiring as few as three constant stylistic examples and a system prompt. We conduct a fine-grained and interpretable evaluation on a diverse set of examples, named JUST-EVAL-INSTRUCT. Results demonstrate that base LLMs with URIAL can match or even surpass the performance of LLMs aligned with SFT or SFT+RLHF. We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting and ICL. Our findings on the superficial nature of alignment tuning and results with URIAL suggest that deeper analysis and theoretical understanding of alignment is crucial to future LLM research.

Does Alignment Tuning Really Break LLMs' Internal Confidence?

On the Calibration of Large Language Models and Alignment

Language Models Resist Alignment: Evidence From Data Compression

Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting

Calibrating Large Language Models with Sample Consistency

The Calibration Gap between Model and Human Confidence in Large Language Models

The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning

On Diversified Preferences of Large Language Model Alignment

Your Weak LLM is Secretly a Strong Teacher for Alignment

Calibrating the Confidence of Large Language Models by Eliciting Fidelity

Aligners: Decoupling LLMs and Alignment

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

A Survey of Calibration Process for Black-Box LLMs

Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning

Human-Instruction-Free LLM Self-Alignment with Limited Samples

Stealthy and Persistent Unalignment on Large Language Models via Backdoor Injections

Fake Alignment: Are LLMs Really Aligned Well?

Calibrating Long-form Generations from Large Language Models

Understanding the Learning Dynamics of Alignment with Human Feedback

Large Language Model Alignment: A Survey

Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models