Abstract:Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results underline ETPO's potential as a robust method for refining the interactive decision-making capabilities of language agents. For a more detailed preliminary work describing our motivation for token-level decomposition and applying it in PPO methods, please refer to <a class="link-https" data-arxiv-id="2405.15821" href="https://arxiv.org/abs/2405.15821">arXiv:2405.15821</a>.

Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

Safe RLHF: Safe Reinforcement Learning from Human Feedback

SCPO: Safe Reinforcement Learning with Safety Critic Policy Optimization

Enhancing LLM Safety via Constrained Direct Preference Optimization

Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation

Augmented Proximal Policy Optimization for Safe Reinforcement Learning

Rule Based Rewards for Language Model Safety

Stepwise Alignment for Constrained Language Model Policy Optimization

CCPO: Conservatively Constrained Policy Optimization Using State Augmentation

Prompt-Driven LLM Safeguarding via Directed Representation Optimization

Fine-Tuning Language Models with Advantage-Induced Policy Alignment

State-wise Constrained Policy Optimization

Self-Play Preference Optimization for Language Model Alignment

Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment

On Prompt-Driven Safeguarding for Large Language Models

Overcoming Reward Overoptimization via Adversarial Policy Optimization with Lightweight Uncertainty Estimation

Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness

Multi-Agent Constrained Policy Optimisation

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement