Abstract:Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results underline ETPO's potential as a robust method for refining the interactive decision-making capabilities of language agents. For a more detailed preliminary work describing our motivation for token-level decomposition and applying it in PPO methods, please refer to <a class="link-https" data-arxiv-id="2405.15821" href="https://arxiv.org/abs/2405.15821">arXiv:2405.15821</a>.

Scheduled Policy Optimization for Natural Language Communication with Intelligent Agents.

Non-local Policy Optimization via Diversity-regularized Collaborative Exploration

Optimal Exploration Algorithm of Multi-Agent Reinforcement Learning Methods (Student Abstract)

From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning

Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Learning to Model the World with Language

Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization

Trajectory-Oriented Policy Optimization with Sparse Rewards

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Scalable Model-based Policy Optimization for Decentralized Networked Systems

Scheduled Curiosity-Deep Dyna-Q: Efficient Exploration for Dialog Policy Learning

EPO: Hierarchical LLM Agents with Environment Preference Optimization

Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue Policy Learning.

Embodied Executable Policy Learning with Language-based Scene Summarization

Training Language Model Agents without Modifying Language Models

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Multi-Agent Task-Oriented Dialog Policy Learning with Role-Aware Reward Decomposition