Abstract:A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent's behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem that instruction - following policies are highly sensitive to the initial prompt in an open - world environment. Specifically, existing methods require a great deal of human trial - and - error in selecting the best instruction, and the quality of different prompts may not be fully aligned with human intentions, resulting in unstable performance. In addition, when an agent fails to complete a task, it is difficult to determine whether it is due to the limitations of the underlying policy itself or the lack of an appropriate prompt. To solve these problems, the authors propose a framework named **Preference Goal Tuning (PGT)**. PGT optimizes the instruction - following policy in the following ways: 1. **Data collection phase**: Use the initial prompt to generate multiple trajectories, and classify these trajectories into positive and negative samples according to human preferences or environmental rewards. 2. **Training phase**: Use the preference learning algorithm to fine - tune the initial goal latent representation while keeping the policy backbone frozen. This can significantly improve performance with a small amount of data and computational resources without changing the model structure. ### Main contributions - **Performance improvement**: The experimental results show that PGT achieves an average relative performance improvement of 72.0% and 81.6% on two different underlying policies (GROOT and STEVE - 1), respectively, and also performs well in out - of - distribution (OOD) environments. - **Continuous learning ability**: As an efficient continuous learning method, PGT can avoid catastrophic forgetting and task interference by storing a single latent representation for each task. - **Long - term task processing**: By combining a high - level planner and a low - level controller, PGT can show stronger robustness and environmental generalization ability in long - term tasks. - **New skill mining**: By fine - tuning the soft prompt, PGT can activate the capabilities that were already present but not fully utilized in the pre - training stage, thereby completing some tasks that could not be completed originally. ### Summary In summary, the PGT framework not only improves the performance of instruction - following policies in various tasks, but also shows its potential in continuous learning and complex task processing, providing new ideas for optimizing agent behavior in an open - world environment.

Optimizing Latent Goal by Learning from Trajectory Preference

Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction

PTR-PPO: Proximal Policy Optimization with Prioritized Trajectory Replay

Effective Tuning Strategies for Generalist Robot Manipulation Policies

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

GROOT: Learning to Follow Instructions by Watching Gameplay Videos

Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

Prompt-Tuning Decision Transformer with Preference Ranking

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning.

Preference-Guided Reinforcement Learning for Efficient Exploration

Goal-Reaching Policy Learning from Non-Expert Observations via Effective Subgoal Guidance

RL-GPT: Integrating Reinforcement Learning and Code-as-policy

Breadcrumbs to the Goal: Goal-Conditioned Exploration from Human-in-the-Loop Feedback

Trajectory-Oriented Policy Optimization with Sparse Rewards

Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs

RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches

Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference

Online Guidance Graph Optimization for Lifelong Multi-Agent Path Finding

Reparameterized Policy Learning for Multimodal Trajectory Optimization

Improved Exploration through Latent Trajectory Optimization in Deep Deterministic Policy Gradient