Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge

Meshal Alharbi,Mardavij Roozbehani,Munther Dahleh

DOI: https://doi.org/10.1609/aaai.v38i10.28953

2024-06-03

Abstract:The problem of sample complexity of online reinforcement learning is often studied in the literature without taking into account any partial knowledge about the system dynamics that could potentially accelerate the learning process. In this paper, we study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We focus on systems that evolve according to an additive disturbance model of the form $S_{h+1} = f(S_h, A_h) + W_h$, where $f$ represents the underlying system dynamics, and $W_h$ are unknown disturbances independent of states and actions. In the setting of finite episodic Markov decision processes with $S$ states, $A$ actions, and episode length $H$, we present an optimistic Q-learning algorithm that achieves $\tilde{\mathcal{O}}(\text{Poly}(H)\sqrt{T})$ regret under perfect knowledge of $f$, where $T$ is the total number of interactions with the system. This is in contrast to the typical $\tilde{\mathcal{O}}(\text{Poly}(H)\sqrt{SAT})$ regret for existing Q-learning methods. Further, if only a noisy estimate $\hat{f}$ of $f$ is available, our method can learn an approximately optimal policy in a number of samples that is independent of the cardinalities of state and action spaces. The sub-optimality gap depends on the approximation error $\hat{f}-f$, as well as the Lipschitz constant of the corresponding optimal value function. Our approach does not require modeling of the transition probabilities and enjoys the same memory complexity as model-free methods.

Machine Learning,Optimization and Control

What problem does this paper attempt to address?

The paper primarily addresses the issue of how to utilize partially known dynamics knowledge to improve sample efficiency in reinforcement learning. Specifically, the paper focuses on online Q-learning methods, where partial information about the system dynamics may be known or can be efficiently learned. The core problem of the research is to explore how to leverage this prior knowledge to accelerate the learning process while considering the inherent complexity of the system dynamics. The paper proposes an optimistic Q-learning algorithm (UCB-f), which can achieve a regret bound of $\tilde{O}(\text{POLY}(H)\sqrt{T})$ when the dynamics function $f$ is fully known, where $T$ represents the total number of interactions with the system. This is a significant improvement compared to the typical regret bound of existing Q-learning methods, $\tilde{O}(\text{POLY}(H)\sqrt{SAT})$, where $S$ and $A$ are the sizes of the state space and action space, respectively. Furthermore, when only a noisy estimate $\hat{f}$ is available, the method can learn an approximately optimal policy without depending on the sizes of the state space and action space. The suboptimality gap depends on the estimation error $\|\hat{f}-f\|$ and the Lipschitz constant $L$ corresponding to the optimal value function. In summary, the paper aims to reduce the sample complexity in reinforcement learning by effectively utilizing partial knowledge of the dynamics, especially when the dynamics have some structure or are partially known. This work is of significant importance for improving the efficiency of reinforcement learning algorithms in practical applications.

Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge

Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning

Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

Sample Complexity of Episodic Fixed-Horizon Reinforcement Learning

Settling the Sample Complexity of Model-Based Offline Reinforcement Learning

Posterior Sampling-based Online Learning for Episodic POMDPs

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency

Actions Speak What You Want: Provably Sample-Efficient Reinforcement Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks

Sample Complexity of Variance-reduced Distributionally Robust Q-learning

Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Provably Efficient Reinforcement Learning in Partially Observable Dynamical Systems

Settling the Horizon-Dependence of Sample Complexity in Reinforcement Learning

Efficient Reinforcement Learning with Impaired Observability: Learning to Act with Delayed and Missing State Observations

Hybrid Reinforcement Learning Breaks Sample Size Barriers in Linear MDPs

Sample Efficient Reinforcement Learning Method Via High Efficient Episodic Memory.

Sample-efficient Safe Learning for Online Nonlinear Control with Control Barrier Functions

Sample-Efficient Reinforcement Learning with Temporal Logic Objectives: Leveraging the Task Specification to Guide Exploration

Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight

Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model

Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP