ProgressGym: Alignment with a Millennium of Moral Progress

Tianyi Qiu,Yang Zhang,Xuchuan Huang,Jasmine Xinze Li,Jiaming Ji,Yaodong Yang
2024-10-31
Abstract:Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at <a class="link-external link-https" href="https://github.com/PKU-Alignment/ProgressGym" rel="external noopener nofollow">this https URL</a> and <a class="link-external link-https" href="https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard" rel="external noopener nofollow">this https URL</a> respectively.
Machine Learning,Artificial Intelligence,Computation and Language,Computers and Society,Human-Computer Interaction
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is **how to prevent frontier AI systems (such as large - language models, LLMs) from reinforcing wrong moral concepts in interactions with humans and leading to social value lock - in**. Specifically, the author points out that existing alignment methods (such as reinforcement learning based on human feedback, RLHF) are vulnerable to biases and moral blind spots in contemporary human preference data, which may exacerbate this risk. To address this issue, the paper proposes the concept and technical solution of **progress alignment**. Progress alignment aims to mitigate the risk of value lock - in by simulating the mechanism of human moral progress. To this end, the author introduces the **ProgressGym** framework, an experimental platform that allows learning the mechanism of moral progress from historical data and applying it to real - world moral decision - making challenges. ### Specific Problems and Solutions 1. **Problem Description**: - Frontier AI systems (such as LLMs) may inadvertently reinforce existing social values, including those that are problematic or misleading, in interactions with humans. - This phenomenon may lead to "value lock - in", that is, certain problematic moral practices and social policies are solidified for a long time and difficult to change. 2. **Solution**: - **Progress Alignment**: A new alignment method is proposed, aiming to combat value lock - in by learning and implementing the mechanism of human moral progress. - **ProgressGym Framework**: An experimental framework is constructed, using 9 centuries of historical texts and 18 historical LLMs to transform real - world progress alignment challenges into specific machine - learning benchmark tasks. - **Core Challenges**: - **PG - Follow**: Track values that evolve over time. - **PG - Predict**: Predict future moral progress. - **PG - Coevolve**: Regulate the value feedback loop between humans and AI. ### Main Contributions - **Theoretical Contribution**: The concept of progress alignment is proposed and formalized as a partially observable Markov decision process (POMDP), in which agents need to learn and adapt to changing human values. - **Technical Contribution**: The ProgressGym framework is constructed, providing large - scale historical text and LLM datasets, as well as specific implementations of the three core challenges. - **Algorithmic Contribution**: Two baseline algorithms, lifelong learning and extrapolative, are proposed for progress alignment, and their performance on different tasks is demonstrated. Through these efforts, the paper aims to promote the dynamic alignment of AI systems with human values, ensuring that AI can promote rather than hinder the progress of human morality in the long - term development.