Optimizing Latent Goal by Learning from Trajectory Preference

Guangyu Zhao,Kewei Lian,Haowei Lin,Haobo Fu,Qiang Fu,Shaofei Cai,Zihao Wang,Yitao Liang
2024-12-03
Abstract:A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent's behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named Preference Goal Tuning (PGT). PGT allows an instruction following policy to interact with the environment to collect several trajectories, which will be categorized into positive and negative samples based on preference. Then we use preference learning to fine-tune the initial goal latent representation with the categorized trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.0% and 81.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for continual learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.
Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem that instruction - following policies are highly sensitive to the initial prompt in an open - world environment. Specifically, existing methods require a great deal of human trial - and - error in selecting the best instruction, and the quality of different prompts may not be fully aligned with human intentions, resulting in unstable performance. In addition, when an agent fails to complete a task, it is difficult to determine whether it is due to the limitations of the underlying policy itself or the lack of an appropriate prompt. To solve these problems, the authors propose a framework named **Preference Goal Tuning (PGT)**. PGT optimizes the instruction - following policy in the following ways: 1. **Data collection phase**: Use the initial prompt to generate multiple trajectories, and classify these trajectories into positive and negative samples according to human preferences or environmental rewards. 2. **Training phase**: Use the preference learning algorithm to fine - tune the initial goal latent representation while keeping the policy backbone frozen. This can significantly improve performance with a small amount of data and computational resources without changing the model structure. ### Main contributions - **Performance improvement**: The experimental results show that PGT achieves an average relative performance improvement of 72.0% and 81.6% on two different underlying policies (GROOT and STEVE - 1), respectively, and also performs well in out - of - distribution (OOD) environments. - **Continuous learning ability**: As an efficient continuous learning method, PGT can avoid catastrophic forgetting and task interference by storing a single latent representation for each task. - **Long - term task processing**: By combining a high - level planner and a low - level controller, PGT can show stronger robustness and environmental generalization ability in long - term tasks. - **New skill mining**: By fine - tuning the soft prompt, PGT can activate the capabilities that were already present but not fully utilized in the pre - training stage, thereby completing some tasks that could not be completed originally. ### Summary In summary, the PGT framework not only improves the performance of instruction - following policies in various tasks, but also shows its potential in continuous learning and complex task processing, providing new ideas for optimizing agent behavior in an open - world environment.