Abstract:Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

Large Language Model Alignment: A Survey

The Problem of Alignment

Alignment is not sufficient to prevent large language models from generating harmful information: A psychoanalytic perspective

Understanding the Learning Dynamics of Alignment with Human Feedback

One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Cultural Alignment in Large Language Models: An Explanatory Analysis Based on Hofstede's Cultural Dimensions

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Aligning with Logic: Measuring, Evaluating and Improving Logical Consistency in Large Language Models

ABC Align: Large Language Model Alignment for Safety & Accuracy

A Moral Imperative: The Need for Continual Superalignment of Large Language Models

From Distributional to Overton Pluralism: Investigating Large Language Model Alignment

Unintended Impacts of LLM Alignment on Global Representation

LLM Theory of Mind and Alignment: Opportunities and Risks

From Instructions to Intrinsic Human Values -- A Survey of Alignment Goals for Big Models

Toward Cultural Interpretability: A Linguistic Anthropological Framework for Describing and Evaluating Large Language Models (LLMs)

The benefits, risks and bounds of personalizing the alignment of large language models to individuals

Improving and Assessing the Fidelity of Large Language Models Alignment to Online Communities

Aligning Large Language Models with Human: A Survey

Your Weak LLM is Secretly a Strong Teacher for Alignment