Abstract:Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at <a class="link-external link-https" href="https://github.com/PKU-Alignment/ProgressGym" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard" rel="external noopener nofollow">this https URL</a> respectively.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is **how to prevent frontier AI systems (such as large - language models, LLMs) from reinforcing wrong moral concepts in interactions with humans and leading to social value lock - in**. Specifically, the author points out that existing alignment methods (such as reinforcement learning based on human feedback, RLHF) are vulnerable to biases and moral blind spots in contemporary human preference data, which may exacerbate this risk. To address this issue, the paper proposes the concept and technical solution of **progress alignment**. Progress alignment aims to mitigate the risk of value lock - in by simulating the mechanism of human moral progress. To this end, the author introduces the **ProgressGym** framework, an experimental platform that allows learning the mechanism of moral progress from historical data and applying it to real - world moral decision - making challenges. ### Specific Problems and Solutions 1. **Problem Description**: - Frontier AI systems (such as LLMs) may inadvertently reinforce existing social values, including those that are problematic or misleading, in interactions with humans. - This phenomenon may lead to "value lock - in", that is, certain problematic moral practices and social policies are solidified for a long time and difficult to change. 2. **Solution**: - **Progress Alignment**: A new alignment method is proposed, aiming to combat value lock - in by learning and implementing the mechanism of human moral progress. - **ProgressGym Framework**: An experimental framework is constructed, using 9 centuries of historical texts and 18 historical LLMs to transform real - world progress alignment challenges into specific machine - learning benchmark tasks. - **Core Challenges**: - **PG - Follow**: Track values that evolve over time. - **PG - Predict**: Predict future moral progress. - **PG - Coevolve**: Regulate the value feedback loop between humans and AI. ### Main Contributions - **Theoretical Contribution**: The concept of progress alignment is proposed and formalized as a partially observable Markov decision process (POMDP), in which agents need to learn and adapt to changing human values. - **Technical Contribution**: The ProgressGym framework is constructed, providing large - scale historical text and LLM datasets, as well as specific implementations of the three core challenges. - **Algorithmic Contribution**: Two baseline algorithms, lifelong learning and extrapolative, are proposed for progress alignment, and their performance on different tasks is demonstrated. Through these efforts, the paper aims to promote the dynamic alignment of AI systems with human values, ensuring that AI can promote rather than hinder the progress of human morality in the long - term development.

ProgressGym: Alignment with a Millennium of Moral Progress

Moral Alignment for LLM Agents

Self-Alignment of Large Language Models via Monopolylogue-based Social Scene Simulation

Language Model Alignment in Multilingual Trolley Problems

From Instructions to Intrinsic Human Values -- A Survey of Alignment Goals for Big Models

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

A Moral Imperative: The Need for Continual Superalignment of Large Language Models

Agent Alignment in Evolving Social Norms

FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas

On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models

Strong and weak alignment of large language models with human values

Denevil: Towards Deciphering and Navigating the Ethical Values of Large Language Models via Instruction Learning

Dynamic Normativity: Necessary and Sufficient Conditions for Value Alignment

The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

ValueCompass: A Framework of Fundamental Values for Human-AI Alignment

AI Alignment: A Comprehensive Survey

Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing

The Alignment Problem from a Deep Learning Perspective

Aligner: Efficient Alignment by Learning to Correct

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation