Large Language Model Alignment: A Survey

Tianhao Shen,Renren Jin,Yufei Huang,Chuang Liu,Weilong Dong,Zishan Guo,Xinwei Wu,Yan Liu,Deyi Xiong
2023-09-26
Abstract:Recent years have witnessed remarkable progress made in large language models (LLMs). Such advancements, while garnering significant attention, have concurrently elicited various concerns. The potential of these models is undeniably vast; however, they may yield texts that are imprecise, misleading, or even detrimental. Consequently, it becomes paramount to employ alignment techniques to ensure these models to exhibit behaviors consistent with human values. This survey endeavors to furnish an extensive exploration of alignment methodologies designed for LLMs, in conjunction with the extant capability research in this domain. Adopting the lens of AI alignment, we categorize the prevailing methods and emergent proposals for the alignment of LLMs into outer and inner alignment. We also probe into salient issues including the models' interpretability, and potential vulnerabilities to adversarial attacks. To assess LLM alignment, we present a wide variety of benchmarks and evaluation methodologies. After discussing the state of alignment research for LLMs, we finally cast a vision toward the future, contemplating the promising avenues of research that lie ahead. Our aspiration for this survey extends beyond merely spurring research interests in this realm. We also envision bridging the gap between the AI alignment research community and the researchers engrossed in the capability exploration of LLMs for both capable and safe LLMs.
Computation and Language,Artificial Intelligence
What problem does this paper attempt to address?
The paper primarily explores a series of ethical and social risks faced by large language models (LLMs) during their rapid development and proposes a comprehensive alignment framework to ensure that the behavior of these models is consistent with human values. Specifically: 1. **Background and Motivation**: With the development of LLMs such as ChatGPT and GPT-4, their performance on many tasks is approaching or even surpassing human levels. However, these models may also generate harmful information, leak private data, or produce misleading content, thereby posing social and ethical risks. 2. **Social and Ethical Risks of LLMs**: - **Content Generation Issues**: LLMs may generate content with biases, toxic or sensitive information, especially regarding gender, cultural, and social biases. - **Malicious Use and Negative Impact**: LLMs may be used for illegal purposes such as creating fake news, and network attack codes; additionally, large-scale deployment of LLMs may lead to changes in the labor market and environmental issues. 3. **Potential Risks of Advanced LLMs**: With technological advancements, future LLMs may exhibit characteristics such as self-awareness, deceptive behavior, self-preservation tendencies, and power-seeking, all of which could bring unforeseen risks. 4. **Concept of LLM Alignment**: To address the above challenges, the paper defines the concept of LLM alignment, which ensures that the model's goals (external and internal goals) are consistent with human values. This includes external alignment (choosing the correct loss function or reward function) and internal alignment (ensuring the model's actual training achieves the goals set by the designers). By constructing this framework, the authors hope to promote LLM research that not only enhances capabilities but also focuses on safety and reliability, ensuring that future LLMs can develop in a manner consistent with human values.