Abstract:Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Demo videos are available on our website: <a class="link-external link-https" href="https://sites.google.com/view/residual-mppi" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper is primarily dedicated to addressing the issue of how to customize already trained policies online in continuous control tasks. Specifically, the research focuses on how to meet new requirements that arise in real-world application scenarios, which were not anticipated during the original training phase, without the need to retrain the entire policy. Typically, solving such problems might involve fine-tuning the policy to adapt to new requirements, but this often necessitates collecting new data with the additional requirements and accessing the original training metrics and policy parameters. In contrast, the method proposed in this paper—Residual-MPPI (Residual Model Predictive Path Integral)—is a general online planning algorithm that can customize continuous control policies at execution time without needing to understand the original training scheme or the specific details of the task. The core idea of Residual-MPPI is to combine the advantages of Reinforcement Learning (RL) and planning methods, utilizing the Model Predictive Path Integral (MPPI) framework to achieve online customization of policies. This approach can effectively customize policies with a small number of samples or even zero samples. Experimental results show that Residual-MPPI can effectively accomplish online policy customization tasks in multiple standard benchmark environments (such as MuJoCo) and more complex racing simulation environments (such as Gran Turismo Sport). It has shown significant effectiveness, particularly in customizing safe route selection for the championship-level racing policy GT Sophy 1.0. In summary, the paper aims to address the challenge of efficiently adjusting existing policies to meet newly emerging requirements, especially in situations where new training data is not easily obtainable. By introducing the Residual-MPPI algorithm, the authors demonstrate the possibility of achieving flexible and rapid policy customization in various continuous control tasks.

Residual-MPPI: Online Policy Customization for Continuous Control

Residual Q-Learning: Offline and Online Policy Customization without Value

RL-Driven MPPI: Accelerating Online Control Laws Calculation with Offline Policy

Residual Policy Learning Facilitates Efficient Model-Free Autonomous Racing

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Unified Policy Optimization for Continuous-action Reinforcement Learning in Non-stationary Tasks and Games

Residual Policy Learning for Powertrain Control

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

RaceMOP: Mapless Online Path Planning for Multi-Agent Autonomous Racing using Residual Policy Learning

Mildly Constrained Evaluation Policy for Offline Reinforcement Learning

Making Better Decision by Directly Planning in Continuous Control

PROTO: Iterative Policy Regularized Offline-to-Online Reinforcement Learning

Reparameterized Policy Learning for Multimodal Trajectory Optimization

Policy ensemble gradient for continuous control problems in deep reinforcement learning

Efficient Deep Learning of Robust, Adaptive Policies using Tube MPC-Guided Data Augmentation

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Training Efficient Controllers via Analytic Policy Gradient

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes

Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation

Theoretically Guaranteed Policy Improvement Distilled from Model-Based Planning

Efficient and Stable Offline-to-online Reinforcement Learning Via Continual Policy Revitalization