Abstract:A broad use case of large language models (LLMs) is in goal-directed decision-making tasks (or "agent" tasks), where an LLM needs to not just generate completions for a given prompt, but rather make intelligent decisions over a multi-turn interaction to accomplish a task (e.g., when interacting with the web, using tools, or providing customer support). Reinforcement learning (RL) provides a general paradigm to address such agent tasks, but current RL methods for LLMs largely focus on optimizing single-turn rewards. By construction, most single-turn RL methods cannot endow LLMs with the ability to intelligently seek information over multiple turns, perform credit assignment, or reason about their past actions -- all of which are critical in agent tasks. This raises the question: how can we design effective and efficient multi-turn RL algorithms for LLMs? In this paper, we develop a framework for building multi-turn RL algorithms for fine-tuning LLMs, that preserves the flexibility of existing single-turn RL methods for LLMs (e.g., proximal policy optimization), while accommodating multiple turns, long horizons, and delayed rewards effectively. To do this, our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel: a high-level off-policy value-based RL algorithm to aggregate reward over utterances, and a low-level RL algorithm that utilizes this high-level value function to train a token policy within each utterance or turn. Our hierarchical framework, Actor-Critic Framework with a Hierarchical Structure (ArCHer), can also give rise to other RL methods. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks, attaining a sample efficiency of about 100x over existing methods, while also improving with larger model capacity (upto the 7 billion scale that we tested on).

Guiding Multi-agent Multi-task Reinforcement Learning by a Hierarchical Framework with Logical Reward Shaping

Logic-based Reward Shaping for Multi-Agent Reinforcement Learning

Subgoal-based Hierarchical Reinforcement Learning for Multi-Agent Collaboration

Hierarchical Deep Multiagent Reinforcement Learning with Temporal Abstraction

Leveraging Organizational Hierarchy to Simplify Reward Design in Cooperative Multi-agent Reinforcement Learning

Hierarchical Multi-Agent Reinforcement Learning for Cooperative Tasks with Sparse Rewards in Continuous Domain

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Hierarchical Method for Cooperative Multiagent Reinforcement Learning in Markov Decision Processes

Hierarchical Reinforcement Learning Based Multi-Agent Collaborative Control Approach

Hierarchical Consensus-Based Multi-Agent Reinforcement Learning for Multi-Robot Cooperation Tasks

Retrieval-Augmented Hierarchical in-Context Reinforcement Learning and Hindsight Modular Reflections for Task Planning with LLMs

Hierarchical reinforcement learning for handling sparse rewards in multi-goal navigation

A Framework for Following Temporal Logic Instructions with Unknown Causal Dependencies

A Hierarchical Framework for Cooperative Tasks in Multi-agent Systems

Reinforcement Learning with Task Decomposition for Cooperative Multiagent Systems.

LMRL: a Multi-Agent Reinforcement Learning Model and Algorithm

Goal-Conditioned Hierarchical Reinforcement Learning with High-Level Model Approximation.

LLM Augmented Hierarchical Agents

Hierarchical Task Network Planning for Facilitating Cooperative Multi-Agent Reinforcement Learning

Hierarchical Reinforcement Learning with Opponent Modeling for Distributed Multi-agent Cooperation

Nl2Hltl2Plan: Scaling Up Natural Language Understanding for Multi-Robots Through Hierarchical Temporal Logic Task Representation