Abstract:To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modelling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.

What problem does this paper attempt to address?

### The problems the paper attempts to solve This paper aims to solve a key problem in Reinforcement Learning (RL): **How to improve the robustness and generalization ability of the algorithm when there are differences between the training environment and the test environment**. Specifically, traditional RL methods usually assume that the training and test environments are the same, but in practical applications, the test environment may be more complex or different from the training environment, especially in fields such as financial markets and robot control. This environmental mismatch may lead to a significant decline in the performance of the optimal strategy. To solve this problem, the paper introduces the concept of Distributionally Robust Reinforcement Learning (DRRL). DRRL enhances the performance of the algorithm in unknown test environments by optimizing the worst - case expected return within an ambiguity set that contains all possible test distributions. However, existing DRRL algorithms are either model - based or unable to learn from a single sample trajectory, which limits their practical applications. Therefore, the main contribution of the paper is to design a completely new model - free DRRL algorithm, called Distributionally Robust Q - learning with Single Trajectory (DRQ). This algorithm can directly learn the optimal distributionally robust strategy only through a single sample trajectory without modeling the environment, and provides an asymptotic convergence guarantee. Experimental results show that the DRQ algorithm is superior to non - robust methods and other robust RL algorithms in terms of robustness and sample complexity. ### Key technical points 1. **Construction of the ambiguity set**: - Use the Cressie - Read f - divergence family to construct the ambiguity set, covering the common KL divergence and χ² divergence. - Reformulate the DRRL problem through the strong dual form so that it can handle unspecified MDP samples. 2. **Multi - time - scale stochastic approximation scheme**: - Develop a new multi - time - scale stochastic approximation scheme to deal with the additional nonlinearity in the DR Bellman equation. - The update of the Q - table is carried out on the slowest time scale, and the other two time scales are designed to reduce the bias of the plug - in estimator. 3. **Algorithm design**: - Instantiate the framework into the DRQ algorithm to solve the fully online incremental learning problem of the discounted Markov decision process (MDP). - Prove the asymptotic convergence of the algorithm and extend the classical two - time - scale stochastic approximation framework. 4. **Experimental verification**: - Demonstrate the robustness and sample efficiency of the algorithm through environments such as Cliffwalking and American put option. - Develop a deep - learning version of the DRQ algorithm and compare it on classical control tasks such as LunarLander and CartPole. ### Conclusion The paper proposes an innovative DRQ algorithm that can learn on a single sample trajectory, thereby providing better robustness and generalization ability in unknown test environments. Experimental results show that the DRQ algorithm performs well in terms of robustness and sample complexity, providing strong support for reinforcement learning in practical applications.

Single-Trajectory Distributionally Robust Reinforcement Learning

Single-Trajectory Distributionally Robust Reinforcement Learning

On the Foundation of Distributionally Robust Reinforcement Learning

Distributionally Robust Constrained Reinforcement Learning under Strong Duality

Model-Free Robust Reinforcement Learning with Sample Complexity Analysis

Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm

Improved Sample Complexity Bounds for Distributionally Robust Reinforcement Learning

Distributionally Robust Offline Reinforcement Learning with Linear Function Approximation

Upper and Lower Bounds for Distributionally Robust Off-Dynamics Reinforcement Learning

One-Step Distributional Reinforcement Learning

Robust Route Planning with Distributional Reinforcement Learning in a Stochastic Road Network Environment

Sample-Efficient Robust Multi-Agent Reinforcement Learning in the Face of Environmental Uncertainty

A Finite Sample Complexity Bound for Distributionally Robust Q-learning

Incorporating Unlabeled Data into Distributionally Robust Learning

Distributional Reinforcement Learning for Multi-Dimensional Reward Functions

Continuous Control Reinforcement Learning: Distributed Distributional DrQ Algorithms

Bridging Distributionally Robust Learning and Offline RL: An Approach to Mitigate Distribution Shift and Partial Data Coverage

Distributional Reinforcement Learning for Efficient Exploration

On Practical Robust Reinforcement Learning: Adjacent Uncertainty Set and Double-Agent Algorithm.

Improving Robustness via Risk Averse Distributional Reinforcement Learning

Train Trajectory Optimization with High-Risk State Space Boundaries: A Safe Reinforcement Learning Approach