Learning Compact Models for Planning with Exogenous Processes

Rohan Chitnis,Tomás Lozano-Pérez
DOI: https://doi.org/10.48550/arXiv.1909.13870
2019-10-01
Abstract:We address the problem of approximate model minimization for MDPs in which the state is partitioned into endogenous and (much larger) exogenous components. An exogenous state variable is one whose dynamics are independent of the agent's actions. We formalize the mask-learning problem, in which the agent must choose a subset of exogenous state variables to reason about when planning; doing planning in such a reduced state space can often be significantly more efficient than planning in the full model. We then explore the various value functions at play within this setting, and describe conditions under which a policy for a reduced model will be optimal for the full MDP. The analysis leads us to a tractable approximate algorithm that draws upon the notion of mutual information among exogenous state variables. We validate our approach in simulated robotic manipulation domains where a robot is placed in a busy environment, in which there are many other agents also interacting with the objects. Visit <a class="link-external link-http" href="http://tinyurl.com/chitnis-exogenous" rel="external noopener nofollow">this http URL</a> for a supplementary video.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of efficient planning through approximate model minimization in Markov decision processes (MDPs) with endogenous and exogenous state variables. Specifically, the paper focuses on how to select an appropriate subset of exogenous state variables (i.e., "mask") to construct a simplified MDP model, thereby significantly reducing the computational complexity without affecting the overall performance. #### Main research questions 1. **Model simplification**: How can the MDP model containing a large number of exogenous state variables be simplified without significantly losing the quality of the solution? 2. **Mask learning problem**: How can a subset of exogenous state variables (mask) be selected so that the policy based on this subset performs as close to optimal as possible in the complete MDP? 3. **Condition analysis**: Under what conditions can the policy obtained based on the simplified model be the optimal policy in the complete MDP? 4. **Algorithm design**: How can an effective algorithm be designed to find such a mask while ensuring its computational efficiency? #### Specific application scenarios The paper conducts experimental verification through simulating the robot operating environment, where the robot interacts with multiple other agents in a busy environment. The states of these agents are modeled as exogenous state variables. By learning the appropriate mask, the robot can effectively complete tasks without considering all exogenous variables. ### Main contributions of the paper 1. **Formalizing the mask learning problem**: Defines how to select a subset from a large number of exogenous state variables for planning in the simplified model. 2. **Theoretical analysis**: Provides sufficient conditions for the optimal policy in the simplified model to also be the optimal policy in the complete MDP under certain conditions. 3. **Algorithm design**: Proposes a greedy algorithm based on mutual information for selecting the mask and proves its effectiveness in different scenarios. 4. **Experimental verification**: Verifies the effectiveness and superiority of the proposed method through experiments in small - scale and large - scale simulated environments. ### Key formulas - **Objective function**: \[ \tilde{x}^* = \arg\max_{\tilde{x} \subseteq x} J(\tilde{x}) = \arg\max_{\tilde{x} \subseteq x} \mathbb{E}\left[\sum_{t = 0}^{\infty} \gamma^t R(n_t, x_t, \tilde{\pi}(n_t, \tilde{x}_t))\right]-\lambda\cdot\text{Cost}(\tilde{x}) \] where $\tilde{\pi}$ is the policy obtained through planning in the simplified model $\tilde{M}$, $\lambda$ is the regularization parameter, and $\text{Cost}(\tilde{x})$ represents the cost of the mask. - **Mutual information**: \[ D_{KL}(\hat{T}_{\tilde{s}, x_i}\|\hat{T}_{\tilde{s}}\otimes\hat{T}_{x_i}) \] It is used to measure the degree of improvement in predicting the dynamics of $\tilde{s}$ after adding the variable $x_i$ to the mask. ### Summary This paper provides a method for efficient planning in MDPs with a large number of exogenous state variables by introducing the mask learning problem. Through theoretical analysis and experimental verification, the effectiveness and superiority of the proposed method are proved, especially in complex robot operating environments.