Single-Trajectory Distributionally Robust Reinforcement Learning

Zhipeng Liang,Xiaoteng Ma,Jose Blanchet,Jiheng Zhang,Zhengyuan Zhou
2024-09-21
Abstract:To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
### The problems the paper attempts to solve This paper aims to solve a key problem in Reinforcement Learning (RL): **How to improve the robustness and generalization ability of the algorithm when there are differences between the training environment and the test environment**. Specifically, traditional RL methods usually assume that the training and test environments are the same, but in practical applications, the test environment may be more complex or different from the training environment, especially in fields such as financial markets and robot control. This environmental mismatch may lead to a significant decline in the performance of the optimal strategy. To solve this problem, the paper introduces the concept of Distributionally Robust Reinforcement Learning (DRRL). DRRL enhances the performance of the algorithm in unknown test environments by optimizing the worst - case expected return within an ambiguity set that contains all possible test distributions. However, existing DRRL algorithms are either model - based or unable to learn from a single sample trajectory, which limits their practical applications. Therefore, the main contribution of the paper is to design a completely new model - free DRRL algorithm, called Distributionally Robust Q - learning with Single Trajectory (DRQ). This algorithm can directly learn the optimal distributionally robust strategy only through a single sample trajectory without modeling the environment, and provides an asymptotic convergence guarantee. Experimental results show that the DRQ algorithm is superior to non - robust methods and other robust RL algorithms in terms of robustness and sample complexity. ### Key technical points 1. **Construction of the ambiguity set**: - Use the Cressie - Read f - divergence family to construct the ambiguity set, covering the common KL divergence and χ² divergence. - Reformulate the DRRL problem through the strong dual form so that it can handle unspecified MDP samples. 2. **Multi - time - scale stochastic approximation scheme**: - Develop a new multi - time - scale stochastic approximation scheme to deal with the additional nonlinearity in the DR Bellman equation. - The update of the Q - table is carried out on the slowest time scale, and the other two time scales are designed to reduce the bias of the plug - in estimator. 3. **Algorithm design**: - Instantiate the framework into the DRQ algorithm to solve the fully online incremental learning problem of the discounted Markov decision process (MDP). - Prove the asymptotic convergence of the algorithm and extend the classical two - time - scale stochastic approximation framework. 4. **Experimental verification**: - Demonstrate the robustness and sample efficiency of the algorithm through environments such as Cliffwalking and American put option. - Develop a deep - learning version of the DRQ algorithm and compare it on classical control tasks such as LunarLander and CartPole. ### Conclusion The paper proposes an innovative DRQ algorithm that can learn on a single sample trajectory, thereby providing better robustness and generalization ability in unknown test environments. Experimental results show that the DRQ algorithm performs well in terms of robustness and sample complexity, providing strong support for reinforcement learning in practical applications.