Abstract:We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an $\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$ dynamic regret, where $d$ is the feature dimension, $H$ is the episode length, $K$ is the number of episodes, $\bar{P}_K$ is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Regret Minimization For Reinforcement Learning By Evaluating The Optimal Bias Function

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Optimal Sample Complexity for Average Reward Markov Decision Processes

Efficient Exploration in Average-Reward Constrained Reinforcement Learning: Achieving Near-Optimal Regret With Posterior Sampling

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Robust Average-Reward Markov Decision Processes

√N-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank.

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Logarithmic Regret Bounds for Continuous-Time Average-Reward Markov Decision Processes

Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm

Improved Regret Bounds for Linear Adversarial MDPs via Linear Optimization

Optimistic Q-learning for average reward and episodic reinforcement learning

Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost