Enhancing Q-Learning with Large Language Model Heuristics

Xiefeng Wu
2024-05-24
Abstract:Q-learning excels in learning from feedback within sequential decision-making tasks but often requires extensive sampling to achieve significant improvements. While reward shaping can enhance learning efficiency, non-potential-based methods introduce biases that affect performance, and potential-based reward shaping, though unbiased, lacks the ability to provide heuristics for state-action pairs, limiting its effectiveness in complex environments. Large language models (LLMs) can achieve zero-shot learning for simpler tasks, but they suffer from low inference speeds and occasional hallucinations. To address these challenges, we propose \textbf{LLM-guided Q-learning}, a framework that leverages LLMs as heuristics to aid in learning the Q-function for reinforcement learning. Our theoretical analysis demonstrates that this approach adapts to hallucinations, improves sample efficiency, and avoids biasing final performance. Experimental results show that our algorithm is general, robust, and capable of preventing ineffective exploration.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of low sample efficiency faced by the Q - learning algorithm in reinforcement learning when dealing with sequential decision - making tasks. Specifically, although Q - learning performs well in learning from feedback, it usually requires a large number of samples to achieve significant performance improvement. Although existing reward shaping methods can improve learning efficiency, non - potential - based methods will introduce biases that affect performance, and while potential - based reward shaping is unbiased, it has limited ability to provide state - action pair heuristics in complex environments. In addition, large - language models (LLMs) can achieve zero - shot learning on simple tasks, but they have problems such as slow inference speed and occasional hallucinations. To address these challenges, the paper proposes a new framework - LLM - guided Q - learning, which uses LLMs as heuristic means to assist in learning the Q - function to enhance the effect of reinforcement learning. In this way, this method aims to adapt to the hallucination phenomenon, improve sample efficiency, and avoid biases in final performance. Experimental results show that this algorithm is versatile, robust, and can prevent invalid exploration. The main contributions of the paper include: 1. Combining the advantages of reward - shaping techniques and the LLM/VLM agent framework to improve sample efficiency. 2. Transforming the impact of inaccurate or hallucinatory guidance into the cost of exploration. 3. Supporting online correction and being able to interact with human feedback. Through these improvements, the framework proposed in the paper effectively addresses the deficiencies in existing methods and provides a new way to use generative models to accelerate reinforcement learning training.