Abstract:Model-based offline reinforcement Learning (RL) is a promising approach that leverages existing data effectively in many real-world applications, especially those involving high-dimensional inputs like images and videos. To alleviate the distribution shift issue in offline RL, existing model-based methods heavily rely on the uncertainty of learned dynamics. However, the model uncertainty estimation becomes significantly biased when observations contain complex distractors with non-trivial dynamics. To address this challenge, we propose a new approach - \emph{Separated Model-based Offline Policy Optimization} (SeMOPO) - decomposing latent states into endogenous and exogenous parts via conservative sampling and estimating model uncertainty on the endogenous states only. We provide a theoretical guarantee of model uncertainty and performance bound of SeMOPO. To assess the efficacy, we construct the Low-Quality Vision Deep Data-Driven Datasets for RL (LQV-D4RL), where the data are collected by non-expert policy and the observations include moving distractors. Experimental results show that our method substantially outperforms all baseline methods, and further analytical experiments validate the critical designs in our method. The project website is \href{<a class="link-external link-https" href="https://sites.google.com/view/semopo" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://sites.google.com/view/semopo" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to learn high - quality models and strategies in low - quality offline visual datasets. Specifically, the author focuses on how to effectively train reinforcement learning (RL) models in data that contains complex distractors (such as moving backgrounds) and is collected by sub - optimal or random strategies. ### Problem Background 1. **Offline Reinforcement Learning (Offline RL)**: Offline RL refers to learning strategies from a fixed dataset without expensive interactions with the online environment. This is very useful in many practical applications, such as drug discovery and autonomous driving. 2. **Characteristics of Low - Quality Datasets**: - Data is usually collected by non - experts or random strategies. - Observation data is high - dimensional, such as images or videos, and contains complex noise, such as moving backgrounds. ### Limitations of Existing Methods - **Deviation in Model Uncertainty Estimation**: Existing model - based offline RL methods rely on the estimation of model uncertainty to alleviate the distribution shift problem. However, when the observation data contains complex distractors, this uncertainty estimation will be significantly deviated. - **Failure to Distinguish between Relevant and Irrelevant Dynamics**: Previous methods did not distinguish between task - related dynamics and task - unrelated dynamics (such as moving backgrounds). If both are regarded as model uncertainty and used as penalty terms in the reward function, it may lead to the learned strategy being too conservative. ### Solution To address the above challenges, the author proposes a new method - **Separated Model - based Offline Policy Optimization (SeMOPO)**. The main contributions of SeMOPO are as follows: 1. **Separate Latent States**: Decompose the latent state into endogenous states and exogenous states through conservative sampling, and estimate model uncertainty only in the endogenous state. 2. **Construct a Low - Quality Visual Dataset (LQV - D4RL)**: To evaluate the effectiveness of the method, the author constructs a new benchmark dataset LQV - D4RL, in which data is collected by non - expert strategies and the observation data contains moving distractors. 3. **Theoretical Analysis**: Provide theoretical guarantees on the lower bound of policy performance in the endogenous state space, and prove the superiority of the conservative sampling method in distinguishing between task - related and task - unrelated information. 4. **Experimental Verification**: The experimental results on LQV - D4RL show that SeMOPO significantly outperforms all baseline methods, and further analysis experiments verify the effectiveness of each component of this method. ### Key Formulas - **Model Uncertainty Estimation in Endogenous State**: \[ \epsilon_{\tilde{u}}(\pi):=\mathbb{E}_{(s^+, a)\sim\rho^\pi}[\tilde{u}(s^+, a)] \] where $\tilde{u}(s^+, a)$ is an acceptable estimator of model uncertainty. - **Uncertainty Penalty in Reward Function**: \[ \tilde{r}(s^+, a)=r(s^+, a)-\lambda\sum_{i = 1}^{K}(\mu_i(s^+, a)-\bar{\mu}(s^+, a))^2 \] where $\lambda$ is a coefficient for adjusting the penalty weight, and $\bar{\mu}(s^+, a)=\frac{1}{K}\sum_{i = 1}^{K}\mu_i(s^+, a)$ is the average of a set of endogenous dynamic model predictions. Through these innovations, SeMOPO can more accurately handle complex distractors in low - quality visual datasets, thereby learning higher - quality models and strategies.

SeMOPO: Learning High-quality Model and Policy from Low-quality Offline Visual Datasets

Behavior Proximal Policy Optimization

Beyond Reward: Offline Preference-guided Policy Optimization

Design from Policies: Conservative Test-Time Adaptation for Offline Policy Optimization

MOPO: Model-based Offline Policy Optimization

DROP: Conservative Model-based Optimization for Offline Reinforcement Learning

SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

SUMO: Search-Based Uncertainty Estimation for Model-Based Offline Reinforcement Learning

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Offline Reinforcement Learning with Reverse Model-based Imagination

Model-Based Offline Weighted Policy Optimization (Student Abstract)

Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

Diversification of Adaptive Policy for Effective Offline Reinforcement Learning

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Stylized Offline Reinforcement Learning: Extracting Diverse High-Quality Behaviors from Heterogeneous Datasets

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning

MOORe: Model-based Offline-to-Online Reinforcement Learning

PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators

VOCE: Variational Optimization with Conservative Estimation for Offline Safe Reinforcement Learning.

COSBO: Conservative Offline Simulation-Based Policy Optimization