Abstract:We address the problem of approximate model minimization for MDPs in which the state is partitioned into endogenous and (much larger) exogenous components. An exogenous state variable is one whose dynamics are independent of the agent's actions. We formalize the mask-learning problem, in which the agent must choose a subset of exogenous state variables to reason about when planning; doing planning in such a reduced state space can often be significantly more efficient than planning in the full model. We then explore the various value functions at play within this setting, and describe conditions under which a policy for a reduced model will be optimal for the full MDP. The analysis leads us to a tractable approximate algorithm that draws upon the notion of mutual information among exogenous state variables. We validate our approach in simulated robotic manipulation domains where a robot is placed in a busy environment, in which there are many other agents also interacting with the objects. Visit <a class="link-external link-http" href="http://tinyurl.com/chitnis-exogenous" rel="external noopener nofollow">this http URL</a> for a supplementary video.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of efficient planning through approximate model minimization in Markov decision processes (MDPs) with endogenous and exogenous state variables. Specifically, the paper focuses on how to select an appropriate subset of exogenous state variables (i.e., "mask") to construct a simplified MDP model, thereby significantly reducing the computational complexity without affecting the overall performance. #### Main research questions 1. **Model simplification**: How can the MDP model containing a large number of exogenous state variables be simplified without significantly losing the quality of the solution? 2. **Mask learning problem**: How can a subset of exogenous state variables (mask) be selected so that the policy based on this subset performs as close to optimal as possible in the complete MDP? 3. **Condition analysis**: Under what conditions can the policy obtained based on the simplified model be the optimal policy in the complete MDP? 4. **Algorithm design**: How can an effective algorithm be designed to find such a mask while ensuring its computational efficiency? #### Specific application scenarios The paper conducts experimental verification through simulating the robot operating environment, where the robot interacts with multiple other agents in a busy environment. The states of these agents are modeled as exogenous state variables. By learning the appropriate mask, the robot can effectively complete tasks without considering all exogenous variables. ### Main contributions of the paper 1. **Formalizing the mask learning problem**: Defines how to select a subset from a large number of exogenous state variables for planning in the simplified model. 2. **Theoretical analysis**: Provides sufficient conditions for the optimal policy in the simplified model to also be the optimal policy in the complete MDP under certain conditions. 3. **Algorithm design**: Proposes a greedy algorithm based on mutual information for selecting the mask and proves its effectiveness in different scenarios. 4. **Experimental verification**: Verifies the effectiveness and superiority of the proposed method through experiments in small - scale and large - scale simulated environments. ### Key formulas - **Objective function**: \[ \tilde{x}^* = \arg\max_{\tilde{x} \subseteq x} J(\tilde{x}) = \arg\max_{\tilde{x} \subseteq x} \mathbb{E}\left[\sum_{t = 0}^{\infty} \gamma^t R(n_t, x_t, \tilde{\pi}(n_t, \tilde{x}_t))\right]-\lambda\cdot\text{Cost}(\tilde{x}) \] where $\tilde{\pi}$ is the policy obtained through planning in the simplified model $\tilde{M}$, $\lambda$ is the regularization parameter, and $\text{Cost}(\tilde{x})$ represents the cost of the mask. - **Mutual information**: \[ D_{KL}(\hat{T}_{\tilde{s}, x_i}\|\hat{T}_{\tilde{s}}\otimes\hat{T}_{x_i}) \] It is used to measure the degree of improvement in predicting the dynamics of $\tilde{s}$ after adding the variable $x_i$ to the mask. ### Summary This paper provides a method for efficient planning in MDPs with a large number of exogenous state variables by introducing the mask learning problem. Through theoretical analysis and experimental verification, the effectiveness and superiority of the proposed method are proved, especially in complex robot operating environments.

Learning Compact Models for Planning with Exogenous Processes

Minimal Value-Equivalent Partial Models for Scalable and Robust Planning in Lifelong Reinforcement Learning

Learning Abstract World Model for Value-preserving Planning with Options

Minimizing the Negative Side Effects of Planning with Reduced Models

Experiment Planning with Function Approximation

CAMPs: Learning Context-Specific Abstractions for Efficient Planning in Factored MDPs

Learning model-based planning from scratch

Proximity-Based Non-uniform Abstractions for Approximate Planning

Planning with Expectation Models for Control

Planning with a Receding Horizon for Manipulation in Clutter using a Learned Value Function

Near-Optimal Learning and Planning in Separated Latent MDPs

Planning under periodic observations: bounds and bounding-based solutions

Efficient Reinforcement Learning of Task Planners for Robotic Palletization through Iterative Action Masking Learning

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Learning to Imagine Manipulation Goals for Robot Task Planning

INCREMENTAL LEARNING OF PROCEDURAL PLANNING KNOWLEDGE IN CHALLENGING ENVIRONMENTS

Learning Augmented, Multi-Robot Long-Horizon Navigation in Partially Mapped Environments

Learning whom to trust in navigation: dynamically switching between classical and neural planning

Epistemic Exploration for Generalizable Planning and Learning in Non-Stationary Settings

Learning Extrinsic Dexterity with Parameterized Manipulation Primitives

Covert Planning against Imperfect Observers