Abstract:Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to an underwater data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the challenges faced in designing and optimizing the reward function when performing multi - objective reinforcement learning (RL) tasks in complex custom - made environments. Specifically: 1. **Reward function design in complex environments**: - In multi - objective RL tasks, as the requirements and optimization goals increase, the design of the reward function becomes very complex, requiring a great deal of effort to adjust the structure and coefficients of each reward component. - Researchers' requirements may change with different scenarios and over time, and are sometimes ambiguous, which further increases the difficulty of achieving optimal performance. 2. **Lack of a clear feedback mechanism**: - For complex reward functions, it is difficult to solve problems such as code errors and weight imbalances solely through training feedback. Traditional trial - and - error methods are inefficient when facing multiple objectives and it is difficult to find the optimal solution. 3. **Efficient search for the weights of the reward function**: - Multi - objective RL tasks not only require the correct form of the reward components, but also need to scale the weights of these components reasonably. How to quickly and effectively search and adjust these weights without direct human feedback is an important issue. To solve these problems, the authors propose ERFSL (Efficient Reward Function Searcher using LLMs), using large language models (LLMs) as efficient reward function searchers. ERFSL achieves this goal in the following ways: - **Task decomposition**: Decompose user requirements into clear numerical goals, eliminating the ambiguity in training feedback. - **Use of Reward Critic**: Automatically correct the reward component code corresponding to each user requirement. - **White - box search**: Utilize the powerful semantic understanding ability of LLMs to conduct efficient weight search in a clear task context. - **Zero - shot learning**: Meet user requirements with a small number of iterations without providing any reward examples. Through these methods, ERFSL can quickly generate and optimize the reward function of multi - objective RL tasks in complex environments, significantly reducing the number of required iterations and search time.

Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

A Large Language Model-Driven Reward Design Framework via Dynamic Feedback for Reinforcement Learning

Self-Refined Large Language Model as Automated Reward Function Designer for Deep Reinforcement Learning in Robotics

Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

LongReward: Improving Long-context Large Language Models with AI Feedback

Towards Socially and Morally Aware RL agent: Reward Design With LLM

REvolve: Reward Evolution with Large Language Models using Human Feedback

Generating and Evolving Reward Functions for Highway Driving with Large Language Models

Learning Reward for Robot Skills Using Large Language Models via Self-Alignment

LIRE: listwise reward enhancement for preference alignment

Reward Design with Language Models

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

World Models with Hints of Large Language Models for Goal Achieving

Fine-Tuning Language Models with Reward Learning on Policy

On Designing Effective RL Reward at Training Time for LLM Reasoning

Reward-Robust RLHF in LLMs

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft