Abstract:Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenge of low sample efficiency faced by the Q - learning algorithm in reinforcement learning when dealing with sequential decision - making tasks. Specifically, although Q - learning performs well in learning from feedback, it usually requires a large number of samples to achieve significant performance improvement. Although existing reward shaping methods can improve learning efficiency, non - potential - based methods will introduce biases that affect performance, and while potential - based reward shaping is unbiased, it has limited ability to provide state - action pair heuristics in complex environments. In addition, large - language models (LLMs) can achieve zero - shot learning on simple tasks, but they have problems such as slow inference speed and occasional hallucinations. To address these challenges, the paper proposes a new framework - LLM - guided Q - learning, which uses LLMs as heuristic means to assist in learning the Q - function to enhance the effect of reinforcement learning. In this way, this method aims to adapt to the hallucination phenomenon, improve sample efficiency, and avoid biases in final performance. Experimental results show that this algorithm is versatile, robust, and can prevent invalid exploration. The main contributions of the paper include: 1. Combining the advantages of reward - shaping techniques and the LLM/VLM agent framework to improve sample efficiency. 2. Transforming the impact of inaccurate or hallucinatory guidance into the cost of exploration. 3. Supporting online correction and being able to interact with human feedback. Through these improvements, the framework proposed in the paper effectively addresses the deficiencies in existing methods and provides a new way to use generative models to accelerate reinforcement learning training.

Enhancing Q-Learning with Large Language Model Heuristics

Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

Efficient Reinforcement Learning with Large Language Model Priors

From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

Navigating Noisy Feedback: Enhancing Reinforcement Learning with Error-Prone Language Models

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

LLM4RL: Enhancing Reinforcement Learning with Large Language Models

Improving Sample Efficiency of Reinforcement Learning with Background Knowledge from Large Language Models

World Models with Hints of Large Language Models for Goal Achieving

Learning to Trust Your Feelings: Leveraging Self-awareness in LLMs for Hallucination Mitigation

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

Large Language Model as a Policy Teacher for Training Reinforcement Learning Agents

Can LLM be a Good Path Planner based on Prompt Engineering? Mitigating the Hallucination for Path Planning

Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration

Iter-AHMCL: Alleviate Hallucination for Large Language Model via Iterative Model-level Contrastive Learning

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Teaching Large Language Models to Reason with Reinforcement Learning

Introspective Tips: Large Language Model for In-Context Decision Making

Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

Guiding Pretraining in Reinforcement Learning with Large Language Models