Abstract:Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents. We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner's Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

MORAL: Aligning AI with Human Norms through Multi-Objective Reinforced Active Learning

Multi-objective Reinforcement Learning: A Tool for Pluralistic Alignment

Modeling Moral Choices in Social Dilemmas with Multi-Agent Reinforcement Learning

Moral Alignment for LLM Agents

Instilling moral value alignment by means of multi-objective reinforcement learning

Towards Socially and Morally Aware RL agent: Reward Design With LLM

Multi-objective Reinforcement learning from AI Feedback

Scalable Multi-Objective Reinforcement Learning with Fairness Guarantees using Lorenz Dominance

A Two-Stage Multi-Objective Deep Reinforcement Learning Framework.

Demonstration Guided Multi-Objective Reinforcement Learning

Adaptive Alignment: Dynamic Preference Adjustments via Multi-Objective Reinforcement Learning for Pluralistic AI

In Search for Architectures and Loss Functions in Multi-Objective Reinforcement Learning

C-MORL: Multi-Objective Reinforcement Learning through Efficient Discovery of Pareto Front

Learning Fair Policies in Decentralized Cooperative Multi-Agent Reinforcement Learning

Training Value-Aligned Reinforcement Learning Agents Using a Normative Prior

Robust Multi-Agent Reinforcement Learning with Social Empowerment for Coordination and Communication

Utility-Based Reinforcement Learning: Unifying Single-objective and Multi-objective Reinforcement Learning

Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality

When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Mediated Multi-Agent Reinforcement Learning

Multi-Objective Reinforcement Learning Based on Decomposition: A Taxonomy and Framework