Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta,Edmund Mills,Naman Garg,Sumeet Motwani,Chelsea Finn,Divyansh Garg,Rafael Rafailov
2024-08-14
Abstract:Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this ga-through supervised fine-tuning on curated expert demonstrations-often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment-a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model's zero-shot performance from 18.6% to 81.7% success rate (a 340% relative increase) after a single day of data collection and further to 95.4% with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient effective generalization capabilities of large language models (LLMs) in interactive, multi-step environments. Although current large language models demonstrate strong reasoning abilities in natural language tasks, their application as autonomous agents in dynamic environments still faces challenges. Traditional supervised pre-training on static datasets is insufficient to enable agents to perform tasks in complex decision-making environments (such as web navigation). Additionally, methods involving supervised fine-tuning often accumulate errors and have limited exploration data, leading to suboptimal strategy outcomes. To address these issues, the paper proposes a framework that combines Monte Carlo Tree Search (MCTS) with a self-criticism mechanism and utilizes an offline variant of the Direct Preference Optimization (DPO) algorithm to iteratively fine-tune agent interaction data. This approach enables LLM agents to effectively learn from both successful and failed trajectories, thereby improving their generalization capabilities in complex multi-step reasoning tasks. Experimental validation shows that in the WebShop environment, this method significantly outperforms behavior cloning and reinforcement learning baselines, and can surpass average human performance when equipped with online search capabilities. In real-world booking scenarios, this method improves the zero-shot performance of the Llama-3 70B model from 18.6% to 81.7%, a relative increase of 340%, and further increases to 95.4% after one day of data collection. This represents a significant leap in autonomous agent capabilities, paving the way for more complex and reliable decision-making in real-world settings.