Abstract:We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an $\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$ dynamic regret, where $d$ is the feature dimension, $H$ is the episode length, $K$ is the number of episodes, $\bar{P}_K$ is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in a linear - mixed Markov decision process (MDP) with unknown transition probabilities and adversarial rewards, how to achieve near - optimal dynamic regret under full - information feedback. Specifically, the author focuses on: 1. **Non - stationary environment**: Study the situation where the reward function in the environment changes over time, and this change may be adversarial. 2. **Unknown transition probabilities**: The transition probability matrix is unknown and needs to be estimated through learning. 3. **Dynamic regret as a performance measure**: Compared with static regret, dynamic regret can better reflect the adaptability of the algorithm in a non - stationary environment. To address these challenges, the author analyzes the respective advantages and disadvantages of two common methods - the occupancy - measure - based method and the policy - based method - and proposes a new algorithm that combines the advantages of these two methods. Specifically: - **Occupancy - measure - based method**: It is good at dealing with non - stationary environments, but encounters difficulties when the transition probabilities are unknown. - **Policy - based method**: It can effectively deal with unknown transition probabilities, but performs poorly in non - stationary environments. ### Main contributions The new algorithm proposed by the author is called Occupancy - measure - based Optimization with Policy - based Estimation (OOPE), which consists of two main parts: 1. **Global optimization based on occupancy measure**: Use a two - layer structure to deal with the non - stationarity of the environment. 2. **Policy - based value - objective regression**: Used to deal with unknown transition probabilities. Through the combination of these two parts, the author proves that their algorithm can achieve near - optimal dynamic regret, with the formula: \[ \tilde{O}\left(d\sqrt{H^3K} + \sqrt{HK(H + \bar{P}_K)}\right) \] where: - $d$ is the feature dimension, - $H$ is the length of each episode, - $K$ is the number of episodes, - $\bar{P}_K$ is the non - stationarity measure. In addition, the author also proves that this result is minimax optimal within a logarithmic factor and does not require prior knowledge of the specific value of the non - stationarity measure. This makes the algorithm more practical and robust in practical applications.

Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

Dynamic Regret of Adversarial Linear Mixture MDPs

Dynamic Regret of Adversarial MDPs with Unknown Transition and Linear Function Approximation

Improved Algorithm for Adversarial Linear Mixture MDPs with Bandit Feedback and Unknown Transition

Dynamic Regret of Online Markov Decision Processes

Refined Regret for Adversarial MDPs with Linear Function Approximation

Towards Optimal Regret in Adversarial Linear MDPs with Bandit Feedback

Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path

Learning Adversarial Low-rank Markov Decision Processes with Unknown Transition and Full-information Feedback

Learning Infinite-Horizon Average-Reward Linear Mixture MDPs of Bounded Span

Dynamic Regret of Policy Optimization in Non-stationary Environments

Beating Adversarial Low-Rank MDPs with Unknown Transition and Bandit Feedback

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition

Simultaneously Learning Stochastic and Adversarial Episodic MDPs with Known Transition

Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Optimistic Regret Bounds for Online Learning in Adversarial Markov Decision Processes