Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

Guanwen Xie,Jingzehua Xu,Yiyuan Yang,Yimian Ding,Shuai Zhang
2024-11-01
Abstract:Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities
Machine Learning,Artificial Intelligence,Computation and Language,Systems and Control
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges faced in designing and optimizing the reward function when performing multi - objective reinforcement learning (RL) tasks in complex custom - made environments. Specifically: 1. **Reward function design in complex environments**: - In multi - objective RL tasks, as the requirements and optimization goals increase, the design of the reward function becomes very complex, requiring a great deal of effort to adjust the structure and coefficients of each reward component. - Researchers' requirements may change with different scenarios and over time, and are sometimes ambiguous, which further increases the difficulty of achieving optimal performance. 2. **Lack of a clear feedback mechanism**: - For complex reward functions, it is difficult to solve problems such as code errors and weight imbalances solely through training feedback. Traditional trial - and - error methods are inefficient when facing multiple objectives and it is difficult to find the optimal solution. 3. **Efficient search for the weights of the reward function**: - Multi - objective RL tasks not only require the correct form of the reward components, but also need to scale the weights of these components reasonably. How to quickly and effectively search and adjust these weights without direct human feedback is an important issue. To solve these problems, the authors propose ERFSL (Efficient Reward Function Searcher using LLMs), using large language models (LLMs) as efficient reward function searchers. ERFSL achieves this goal in the following ways: - **Task decomposition**: Decompose user requirements into clear numerical goals, eliminating the ambiguity in training feedback. - **Use of Reward Critic**: Automatically correct the reward component code corresponding to each user requirement. - **White - box search**: Utilize the powerful semantic understanding ability of LLMs to conduct efficient weight search in a clear task context. - **Zero - shot learning**: Meet user requirements with a small number of iterations without providing any reward examples. Through these methods, ERFSL can quickly generate and optimize the reward function of multi - objective RL tasks in complex environments, significantly reducing the number of required iterations and search time.