Abstract:Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results underline ETPO's potential as a robust method for refining the interactive decision-making capabilities of language agents. For a more detailed preliminary work describing our motivation for token-level decomposition and applying it in PPO methods, please refer to <a class="link-https" data-arxiv-id="2405.15821" href="https://arxiv.org/abs/2405.15821">arXiv:2405.15821</a>.

An Entropy Regularization Free Mechanism for Policy-based Reinforcement Learning

Promoting Stochasticity for Expressive Policies Via a Simple and Efficient Regularization Method.

Increasing Entropy to Boost Policy Gradient Performance on Personalization Tasks

Off-Policy Actor-Critic in an Ensemble: Achieving Maximum General Entropy and Effective Environment Exploration in Deep Reinforcement Learning

Maximum Entropy Reinforcement Learning with Evolution Strategies

Diversity Actor-Critic: Sample-Aware Entropy Regularization for Sample-Efficient Exploration

A Regularized Approach to Sparse Optimal Policy in Reinforcement Learning.

Examining Policy Entropy of Reinforcement Learning Agents for Personalization Tasks

Stochastic Cubic-Regularized Policy Gradient Method

Implicit Policy for Reinforcement Learning

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

Predictable Reinforcement Learning Dynamics through Entropy Rate Minimization

Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Towards Applicable Reinforcement Learning: Improving the Generalization and Sample Efficiency with Policy Ensemble.

Linear Convergence of Independent Natural Policy Gradient in Games with Entropy Regularization

Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition

Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning