Abstract:Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it remains unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose a principled framework with provable regret guarantees to orchestrate reasoning and acting, which we call "reason for future, act for now" (\texttt{RAFA}). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon ("reason for future"). At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state. The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs to form an updated posterior of the unknown environment from the memory buffer (learning) and generate an optimal trajectory for multiple future steps that maximizes a value function (planning). The learning and planning subroutines are performed in an "in-context" manner to emulate the actor-critic update for MDPs. Our theoretical analysis proves that the novel combination of long-term reasoning and short-term acting achieves a $\sqrt{T}$ regret. Here, $T$ denotes the number of online interactions. In particular, the regret bound highlights an intriguing interplay between the prior knowledge obtained through pretraining and the uncertainty reduction achieved by reasoning and acting. Our empirical validation shows that it outperforms various existing frameworks and achieves nearly perfect scores on a few benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: How can large language models (LLMs) complete a given task with the minimum number of interactions when interacting with the external environment, and this completion method is theoretically provable. Specifically, the author focuses on how to transform the reasoning ability of LLM into practical operations, especially in cases where the task needs to be completed with the least number of interactions. ### Background of the paper and problem description 1. **Reasoning ability of large language models (LLMs)** - LLMs have demonstrated excellent reasoning ability, but still face challenges when applying it to actions in the real world. - In particular, there is a lack of clear methods to ensure the completion of tasks with the minimum number of interactions. 2. **Limitations of reinforcement learning (RL)** - RL is a well - established framework that can improve actions by collecting feedback, but there are conceptual differences in directly applying existing RL techniques to LLM. - For example, RL is based on a numerical system (rewards and transitions are defined by scalars and probabilities), while LLM is based on a language system (inputs and outputs are language tokens). - In addition, LLM remains fixed during the interaction process, unlike RL which trains the actor and critic through parameter updates. 3. **Research objectives** - Design an internal reasoning mechanism that enables LLM to complete tasks within the minimum number of interactions. - Provide a theoretically provable framework to ensure sample efficiency. ### Solution To solve the above problems, the author proposes the "reason for future, act for now" (RAFA) framework, which has the following main features: - **Bayesian adaptive Markov decision process (Bayesian adaptive MDP)** - Formalize reasoning and action as Bayesian adaptive MDP, where the hidden variable is the unknown environment. - Use a memory buffer to store states, actions, rewards, and their language summaries as information states. - **Reasoning and planning sub - programs** - **Learning sub - program**: Estimate the external environment from the memory buffer, and infer the transition and reward model or value function. - **Planning sub - program**: Generate an optimal policy or trajectory to maximize the value function, considering multiple future steps. - **Closed - loop mechanism** - At each step, LLM plans future trajectories according to the current state ("reason for future") and executes the initial action in the plan ("act for now"). - After collecting feedback, re - call the reasoning sub - program and re - plan future trajectories from the new state. ### Theoretical and empirical analysis - **Theoretical analysis** - Prove that the RAFA framework can reach a regret bound of $\sqrt{T}$, where $T$ represents the number of online interactions. - **Experimental verification** - Conducted experimental verification on multiple tasks, including ALFWorld, BlocksWorld, Game of 24, and Tic - Tac - Toe, demonstrating the superior performance of RAFA. ### Summary The main contributions of this paper are as follows: 1. Establish the correspondence between LLM and RL, and design a principled framework RAFA to coordinate reasoning and action. 2. Experimental verification shows that RAFA outperforms existing frameworks in interactive decision - making tasks. 3. Theoretical analysis proves the sample efficiency of RAFA, explaining its strong empirical performance. Through these methods, RAFA not only solves the problem of reasoning - to - action conversion of LLM in real - world applications, but also provides a theoretical guarantee to ensure that tasks can be completed within the minimum number of interactions.

Reason for Future, Act for Now: A Principled Framework for Autonomous LLM Agents with Provable Sample Efficiency

Reason for Future, Act for Now: A Principled Architecture for Autonomous LLM Agents

Reasoning with Language Model is Planning with World Model

From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems

ReAct: Synergizing Reasoning and Acting in Language Models

Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

ReasonPlanner: Enhancing Autonomous Planning in Dynamic Environments with Temporal Knowledge Graphs and LLMs

Cooperative Strategic Planning Enhances Reasoning Capabilities in Large Language Models

Textualized Agent-Style Reasoning for Complex Tasks by Multiple Round LLM Generation

Can LLMs Reason in the Wild with Programs?

Can We Rely on LLM Agents to Draft Long-Horizon Plans? Let's Take TravelPlanner as an Example

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing

ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent

LLM Reasoners: New Evaluation, Library, and Analysis of Step-by-Step Reasoning with Large Language Models

STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making

PRACT: Optimizing Principled Reasoning and Acting of LLM Agent

Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

ArgMed-Agents: Explainable Clinical Decision Reasoning with LLM Disscusion via Argumentation Schemes

KG-Agent: An Efficient Autonomous Agent Framework for Complex Reasoning over Knowledge Graph