Abstract:In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\mathsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in $\tilde O(M C_{m^*}^2 \mathsf{B}_{m^*}(T,\delta))$ where $C_{m^*}$ represents the complexity of the simplest well-specified model class and $\mathsf{B}_{m^*}(T,\delta)$ is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in $M$, the number of model classes. We apply $\mathsf{MRBEAR}$ to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner's goal is to maximize its utility without knowing the opponent's utility function. The interaction is over $T$ rounds with no episode or discounting which leads us to measure the learner's performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in $\tilde O(M(\mathsf{sp}(h^*) B^{m^*} A^{m^*+1})^{\frac{3}{2}} \sqrt{T})$, where $m^*\le M$ is the unknown memory limit of the opponent, $\mathsf{sp}(h^*)$ is the unknown span of optimal bias induced by the opponent, and $A$ and $B$ are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on $m^*$ is inevitable by proving a lower bound on the learner's regret.

Model-free Reinforcement Learning in Infinite-horizon Average-reward Markov Decision Processes

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Regret-Optimal Model-Free Reinforcement Learning for Discounted MDPs with Short Burn-In Time

Almost Optimal Model-Free Reinforcement Learning Via Reference-Advantage Decomposition

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Logarithmic regret bounds for continuous-time average-reward Markov decision processes

Model-Free Robust Average-Reward Reinforcement Learning

Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes

Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation

Scale-free Adversarial Reinforcement Learning

Infinite-Horizon Reinforcement Learning with Multinomial Logistic Function Approximation

Sample-efficient Learning of Infinite-horizon Average-reward MDPs with General Function Approximation

Value-Biased Maximum Likelihood Estimation for Model-based Reinforcement Learning in Discounted Linear MDPs

Model-Free Reinforcement Learning: from Clipped Pseudo-Regret to Sample Complexity

Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games

Efficient Reinforcement Learning in Probabilistic Reward Machines

Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span