SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Shenghua Wan,Ziyuan Chen,Le Gan,Shuai Feng,De-Chuan Zhan
2024-06-13
Abstract:Model-based offline reinforcement Learning (RL) is a promising approach that leverages existing data effectively in many real-world applications, especially those involving high-dimensional inputs like images and videos. To alleviate the distribution shift issue in offline RL, existing model-based methods heavily rely on the uncertainty of learned dynamics. However, the model uncertainty estimation becomes significantly biased when observations contain complex distractors with non-trivial dynamics. To address this challenge, we propose a new approach - \emph{Separated Model-based Offline Policy Optimization} (SeMOPO) - decomposing latent states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO. To assess the efficacy, we construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL), where the data are collected by non-expert policy and the observations include moving distractors. Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. The project website is \href{<a class="link-external link-https" href="https://sites.google.com/view/semopo" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://sites.google.com/view/semopo" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to learn high - quality models and strategies in low - quality offline visual datasets. Specifically, the author focuses on how to effectively train reinforcement learning (RL) models in data that contains complex distractors (such as moving backgrounds) and is collected by sub - optimal or random strategies. ### Problem Background 1. **Offline Reinforcement Learning (Offline RL)**: Offline RL refers to learning strategies from a fixed dataset without expensive interactions with the online environment. This is very useful in many practical applications, such as drug discovery and autonomous driving. 2. **Characteristics of Low - Quality Datasets**: - Data is usually collected by non - experts or random strategies. - Observation data is high - dimensional, such as images or videos, and contains complex noise, such as moving backgrounds. ### Limitations of Existing Methods - **Deviation in Model Uncertainty Estimation**: Existing model - based offline RL methods rely on the estimation of model uncertainty to alleviate the distribution shift problem. However, when the observation data contains complex distractors, this uncertainty estimation will be significantly deviated. - **Failure to Distinguish between Relevant and Irrelevant Dynamics**: Previous methods did not distinguish between task - related dynamics and task - unrelated dynamics (such as moving backgrounds). If both are regarded as model uncertainty and used as penalty terms in the reward function, it may lead to the learned strategy being too conservative. ### Solution To address the above challenges, the author proposes a new method - **Separated Model - based Offline Policy Optimization (SeMOPO)**. The main contributions of SeMOPO are as follows: 1. **Separate Latent States**: Decompose the latent state into endogenous states and exogenous states through conservative sampling, and estimate model uncertainty only in the endogenous state. 2. **Construct a Low - Quality Visual Dataset (LQV - D4RL)**: To evaluate the effectiveness of the method, the author constructs a new benchmark dataset LQV - D4RL, in which data is collected by non - expert strategies and the observation data contains moving distractors. 3. **Theoretical Analysis**: Provide theoretical guarantees on the lower bound of policy performance in the endogenous state space, and prove the superiority of the conservative sampling method in distinguishing between task - related and task - unrelated information. 4. **Experimental Verification**: The experimental results on LQV - D4RL show that SeMOPO significantly outperforms all baseline methods, and further analysis experiments verify the effectiveness of each component of this method. ### Key Formulas - **Model Uncertainty Estimation in Endogenous State**: \[ \epsilon_{\tilde{u}}(\pi):=\mathbb{E}_{(s^+, a)\sim\rho^\pi}[\tilde{u}(s^+, a)] \] where $\tilde{u}(s^+, a)$ is an acceptable estimator of model uncertainty. - **Uncertainty Penalty in Reward Function**: \[ \tilde{r}(s^+, a)=r(s^+, a)-\lambda\sum_{i = 1}^{K}(\mu_i(s^+, a)-\bar{\mu}(s^+, a))^2 \] where $\lambda$ is a coefficient for adjusting the penalty weight, and $\bar{\mu}(s^+, a)=\frac{1}{K}\sum_{i = 1}^{K}\mu_i(s^+, a)$ is the average of a set of endogenous dynamic model predictions. Through these innovations, SeMOPO can more accurately handle complex distractors in low - quality visual datasets, thereby learning higher - quality models and strategies.