Residual-MPPI: Online Policy Customization for Continuous Control

Pengcheng Wang,Chenran Li,Catherine Weaver,Kenta Kawamoto,Masayoshi Tomizuka,Chen Tang,Wei Zhan
2024-07-11
Abstract:Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: <a class="link-external link-https" href="https://sites.google.com/view/residual-mppi" rel="external noopener nofollow">this https URL</a>
Robotics
What problem does this paper attempt to address?
The paper is primarily dedicated to addressing the issue of how to customize already trained policies online in continuous control tasks. Specifically, the research focuses on how to meet new requirements that arise in real-world application scenarios, which were not anticipated during the original training phase, without the need to retrain the entire policy. Typically, solving such problems might involve fine-tuning the policy to adapt to new requirements, but this often necessitates collecting new data with the additional requirements and accessing the original training metrics and policy parameters. In contrast, the method proposed in this paper—Residual-MPPI (Residual Model Predictive Path Integral)—is a general online planning algorithm that can customize continuous control policies at execution time without needing to understand the original training scheme or the specific details of the task. The core idea of Residual-MPPI is to combine the advantages of Reinforcement Learning (RL) and planning methods, utilizing the Model Predictive Path Integral (MPPI) framework to achieve online customization of policies. This approach can effectively customize policies with a small number of samples or even zero samples. Experimental results show that Residual-MPPI can effectively accomplish online policy customization tasks in multiple standard benchmark environments (such as MuJoCo) and more complex racing simulation environments (such as Gran Turismo Sport). It has shown significant effectiveness, particularly in customizing safe route selection for the championship-level racing policy GT Sophy 1.0. In summary, the paper aims to address the challenge of efficiently adjusting existing policies to meet newly emerging requirements, especially in situations where new training data is not easily obtainable. By introducing the Residual-MPPI algorithm, the authors demonstrate the possibility of achieving flexible and rapid policy customization in various continuous control tasks.