Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge

Meshal Alharbi,Mardavij Roozbehani,Munther Dahleh
DOI: https://doi.org/10.1609/aaai.v38i10.28953
2024-06-03
Abstract:The problem of sample complexity of online reinforcement learning is often studied in the literature without taking into account any partial knowledge about the system dynamics that could potentially accelerate the learning process. In this paper, we study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently. We focus on systems that evolve according to an additive disturbance model of the form $S_{h+1} = f(S_h, A_h) + W_h$, where $f$ represents the underlying system dynamics, and $W_h$ are unknown disturbances independent of states and actions. In the setting of finite episodic Markov decision processes with $S$ states, $A$ actions, and episode length $H$, we present an optimistic Q-learning algorithm that achieves $\tilde{\mathcal{O}}(\text{Poly}(H)\sqrt{T})$ regret under perfect knowledge of $f$, where $T$ is the total number of interactions with the system. This is in contrast to the typical $\tilde{\mathcal{O}}(\text{Poly}(H)\sqrt{SAT})$ regret for existing Q-learning methods. Further, if only a noisy estimate $\hat{f}$ of $f$ is available, our method can learn an approximately optimal policy in a number of samples that is independent of the cardinalities of state and action spaces. The sub-optimality gap depends on the approximation error $\hat{f}-f$, as well as the Lipschitz constant of the corresponding optimal value function. Our approach does not require modeling of the transition probabilities and enjoys the same memory complexity as model-free methods.
Machine Learning,Optimization and Control
What problem does this paper attempt to address?
The paper primarily addresses the issue of how to utilize partially known dynamics knowledge to improve sample efficiency in reinforcement learning. Specifically, the paper focuses on online Q-learning methods, where partial information about the system dynamics may be known or can be efficiently learned. The core problem of the research is to explore how to leverage this prior knowledge to accelerate the learning process while considering the inherent complexity of the system dynamics. The paper proposes an optimistic Q-learning algorithm (UCB-f), which can achieve a regret bound of \(\tilde{O}(\text{POLY}(H)\sqrt{T})\) when the dynamics function \(f\) is fully known, where \(T\) represents the total number of interactions with the system. This is a significant improvement compared to the typical regret bound of existing Q-learning methods, \(\tilde{O}(\text{POLY}(H)\sqrt{SAT})\), where \(S\) and \(A\) are the sizes of the state space and action space, respectively. Furthermore, when only a noisy estimate \(\hat{f}\) is available, the method can learn an approximately optimal policy without depending on the sizes of the state space and action space. The suboptimality gap depends on the estimation error \(\|\hat{f}-f\|\) and the Lipschitz constant \(L\) corresponding to the optimal value function. In summary, the paper aims to reduce the sample complexity in reinforcement learning by effectively utilizing partial knowledge of the dynamics, especially when the dynamics have some structure or are partially known. This work is of significant importance for improving the efficiency of reinforcement learning algorithms in practical applications.