Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta,Edmund Mills,Naman Garg,Sumeet Motwani,Chelsea Finn,Divyansh Garg,Rafael Rafailov

2024-08-14

Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of insufficient effective generalization capabilities of large language models (LLMs) in interactive, multi-step environments. Although current large language models demonstrate strong reasoning abilities in natural language tasks, their application as autonomous agents in dynamic environments still faces challenges. Traditional supervised pre-training on static datasets is insufficient to enable agents to perform tasks in complex decision-making environments (such as web navigation). Additionally, methods involving supervised fine-tuning often accumulate errors and have limited exploration data, leading to suboptimal strategy outcomes. To address these issues, the paper proposes a framework that combines Monte Carlo Tree Search (MCTS) with a self-criticism mechanism and utilizes an offline variant of the Direct Preference Optimization (DPO) algorithm to iteratively fine-tune agent interaction data. This approach enables LLM agents to effectively learn from both successful and failed trajectories, thereby improving their generalization capabilities in complex multi-step reasoning tasks. Experimental validation shows that in the WebShop environment, this method significantly outperforms behavior cloning and reinforcement learning baselines, and can surpass average human performance when equipped with online search capabilities. In real-world booking scenarios, this method improves the zero-shot performance of the Llama-3 70B model from 18.6% to 81.7%, a relative increase of 340%, and further increases to 95.4% after one day of data collection. This represents a significant leap in autonomous agent capabilities, paving the way for more complex and reliable decision-making in real-world settings.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

AutoAct: Automatic Agent Learning from Scratch for QA Via Self-Planning

Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning

Multi-Agent Advisor Q-Learning

Tree Search for Language Model Agents

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

ConceptAgent: LLM-Driven Precondition Grounding and Tree Search for Robust Task Planning and Execution

ExACT: Teaching AI Agents to Explore with Reflective-MCTS and Exploratory Learning

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Pangu-Agent: A Fine-Tunable Generalist Agent with Structured Reasoning

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level

Towards Autonomous Agents: Adaptive-planning, Reasoning, and Acting in Language Models

ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Reasoning Capacity in Multi-Agent Systems: Limitations, Challenges and Human-Centered Solutions

DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

MetaReflection: Learning Instructions for Language Agents using Past Reflections

A Language Agent for Autonomous Driving

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent