Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs

Long-Fei Li,Peng Zhao,Zhi-Hua Zhou
2024-11-05
Abstract:We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback, employing dynamic regret as the performance measure. We start with in-depth analyses of the strengths and limitations of the two most popular methods: occupancy-measure-based and policy-based methods. We observe that while the occupancy-measure-based method is effective in addressing non-stationary environments, it encounters difficulties with the unknown transition. In contrast, the policy-based method can deal with the unknown transition effectively but faces challenges in handling non-stationary environments. Building on this, we propose a novel algorithm that combines the benefits of both methods. Specifically, it employs (i) an occupancy-measure-based global optimization with a two-layer structure to handle non-stationary environments; and (ii) a policy-based variance-aware value-targeted regression to tackle the unknown transition. We bridge these two parts by a novel conversion. Our algorithm enjoys an $\widetilde{\mathcal{O}}(d \sqrt{H^3 K} + \sqrt{HK(H + \bar{P}_K)})$ dynamic regret, where $d$ is the feature dimension, $H$ is the episode length, $K$ is the number of episodes, $\bar{P}_K$ is the non-stationarity measure. We show it is minimax optimal up to logarithmic factors by establishing a matching lower bound. To the best of our knowledge, this is the first work that achieves near-optimal dynamic regret for adversarial linear mixture MDPs with the unknown transition without prior knowledge of the non-stationarity measure.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in a linear - mixed Markov decision process (MDP) with unknown transition probabilities and adversarial rewards, how to achieve near - optimal dynamic regret under full - information feedback. Specifically, the author focuses on: 1. **Non - stationary environment**: Study the situation where the reward function in the environment changes over time, and this change may be adversarial. 2. **Unknown transition probabilities**: The transition probability matrix is unknown and needs to be estimated through learning. 3. **Dynamic regret as a performance measure**: Compared with static regret, dynamic regret can better reflect the adaptability of the algorithm in a non - stationary environment. To address these challenges, the author analyzes the respective advantages and disadvantages of two common methods - the occupancy - measure - based method and the policy - based method - and proposes a new algorithm that combines the advantages of these two methods. Specifically: - **Occupancy - measure - based method**: It is good at dealing with non - stationary environments, but encounters difficulties when the transition probabilities are unknown. - **Policy - based method**: It can effectively deal with unknown transition probabilities, but performs poorly in non - stationary environments. ### Main contributions The new algorithm proposed by the author is called Occupancy - measure - based Optimization with Policy - based Estimation (OOPE), which consists of two main parts: 1. **Global optimization based on occupancy measure**: Use a two - layer structure to deal with the non - stationarity of the environment. 2. **Policy - based value - objective regression**: Used to deal with unknown transition probabilities. Through the combination of these two parts, the author proves that their algorithm can achieve near - optimal dynamic regret, with the formula: \[ \tilde{O}\left(d\sqrt{H^3K} + \sqrt{HK(H + \bar{P}_K)}\right) \] where: - \(d\) is the feature dimension, - \(H\) is the length of each episode, - \(K\) is the number of episodes, - \(\bar{P}_K\) is the non - stationarity measure. In addition, the author also proves that this result is minimax optimal within a logarithmic factor and does not require prior knowledge of the specific value of the non - stationarity measure. This makes the algorithm more practical and robust in practical applications.