Model-Based Reinforcement Learning via Latent-Space Collocation

Oleh Rybkin,Chuning Zhu,Anusha Nagabandi,Kostas Daniilidis,Igor Mordatch,Sergey Levine
DOI: https://doi.org/10.48550/arXiv.2106.13229
2021-08-08
Abstract:The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad capabilities. Visual model-based reinforcement learning (RL) methods that plan future actions directly have shown impressive results on tasks that require only short-horizon reasoning, however, these methods struggle on temporally extended tasks. We argue that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize. To achieve this, we draw on the idea of collocation, which has shown good results on long-horizon tasks in optimal control literature, and adapt it to the image-based setting by utilizing learned latent state space models. The resulting latent collocation method (LatCo) optimizes trajectories of latent states, which improves over previously proposed shooting methods for visual model-based RL on tasks with sparse rewards and long-term goals. Videos and code at <a class="link-external link-https" href="https://orybkin.github.io/latco/" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of poor performance of reinforcement learning (RL) algorithms on long - span tasks when using original high - dimensional observations (such as images) for future planning. Specifically, existing visual model - based RL methods perform well in short - term tasks, but encounter difficulties in tasks requiring long - term planning because the effects of actions accumulate over time, making optimization very complex. #### Main problems: 1. **Challenges of long - span tasks**: Existing methods perform poorly when dealing with tasks requiring long - term planning, especially in tasks with sparse rewards and long - term goals. 2. **Limitations of directly optimizing actions**: Traditional "shooting methods" - based methods plan future behaviors by directly optimizing action sequences, but are prone to fall into local optimal solutions in long - span tasks and it is difficult to find the global optimal solution. 3. **Balance between dynamic feasibility and optimization**: How to ensure that these states are dynamically feasible while optimizing the state sequence, that is, each state can be reached from the previous state through a reasonable action. #### Solutions: To solve these problems, the author introduces a new method - **Latent - Space Collocation (LatCo)**. The core idea of this method is to perform long - term planning by optimizing the latent state sequence instead of directly optimizing the action sequence. The specific steps are as follows: - **Latent space modeling**: Use convolutional neural networks (CNN) and recurrent neural networks (RNN) to learn a compact latent state space model that can represent high - dimensional image observations and has the Markov property. - **Collocation method optimization**: Draw on the collocation method in optimal control to optimize the state sequence in the latent space to maximize rewards and ensure dynamic feasibility. - **Constrained optimization**: By introducing Lagrange multipliers and distribution - matching constraints, ensure that the optimized state sequence conforms to the dynamic model and can achieve efficient optimization. Through these methods, LatCo can perform effective long - term planning in complex visual environments and solve the limitations of traditional methods in long - span tasks. ### Formula summary: - Dynamic model constraint: \[ z_{t + 1}=\mu(z_t, a_t) \] - Objective function: \[ \max_{z_{2:T}, a_{1:T - 1}}\sum_{t}r(z_t)\quad\text{s.t.}\quad z_{t + 1}=\mu(z_t, a_t) \] - Lagrange function: \[ L(z_{t+1:t + H}, a_{t:t + H}, \lambda)=\sum_{t}\left[r(z_t)-\lambda_{dyn,t}\left(\|z_{t + 1}-\mu(z_t, a_t)\|^2-\epsilon_{dyn}\right)-\lambda_{act,t}\left(\max(0, |a_t|-a_m)^2-\epsilon_{act}\right)\right] \] In this way, LatCo can perform efficient long - term planning in complex environments and significantly improve the performance of visual - based model - predictive control methods in long - span tasks.