A Hybrid PAC Reinforcement Learning Algorithm

Ashkan Zehfroosh,Herbert G. Tanner
DOI: https://doi.org/10.48550/arXiv.2009.02602
2021-01-28
Abstract:This paper offers a new hybrid probably approximately correct (PAC) reinforcement learning (RL) algorithm for Markov decision processes (MDPs) that intelligently maintains favorable features of its parents. The designed algorithm, referred to as the Dyna-Delayed Q-learning (DDQ) algorithm, combines model-free and model-based learning approaches while outperforming both in most cases. The paper includes a PAC analysis of the DDQ algorithm and a derivation of its sample complexity. Numerical results are provided to support the claim regarding the new algorithm's sample efficiency compared to its parents as well as the best known model-free and model-based algorithms in application.
Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a new hybrid PAC (Probably Approximately Correct) reinforcement learning algorithm to effectively combine the advantages of model - free and model - related learning methods. Specifically, the paper proposes the Dyna - Delayed Q - learning (DDQ) algorithm, aiming to overcome the limitations of existing PAC MDP algorithms in sample complexity and outperform existing model - free and model - related algorithms in most cases. ### Problem Background 1. **Model - free Reinforcement Learning (Model - free RL)** - **Advantages**: It is suitable for complex tasks and does not need to construct an environmental model. - **Disadvantages**: It requires a large amount of data (high sample complexity) because it only relies on the observed rewards to learn the optimal policy. 2. **Model - related Reinforcement Learning (Model - based RL)** - **Advantages**: It uses state - transition information to construct a model, thereby reducing sample complexity. - **Disadvantages**: It has a relatively high computational complexity and may be biased. ### Goals of the Paper The goals of the paper are to design a new PAC algorithm that can gain advantages in the following aspects: - **Combining the Advantages of Model - free and Model - related Learning**: It behaves more like a model - free algorithm on large - scale problems and more like a model - related algorithm on small - scale problems that require high precision. - **Ensuring PAC Properties**: That is, it can ensure that the algorithm converges to an approximately optimal policy within a limited time. - **Optimizing Sample Complexity**: Ensure that in the worst - case scenario, the sample complexity of the DDQ algorithm is no worse than either R - max or Delayed Q - learning, and is usually better than both. ### Features of the DDQ Algorithm The DDQ algorithm achieves the following functions by integrating the characteristics of Delayed Q - learning and R - max algorithms: - **Type - 1 Update**: Update the Q - value based on the most recent \(m_1\) experiences, ensuring that each update is reduced by at least \(\epsilon_1\). - **Type - 2 Update**: When a state - action pair has been visited at least \(m_2\) times, use the value - iteration algorithm for the update. - **Dynamic Adjustment**: According to the scale and requirements of the problem, intelligently select a suitable update method, so as to maintain high efficiency and accuracy in different scenarios. ### Experimental Results The paper verifies the effectiveness of the DDQ algorithm through numerical experiments, especially in terms of sample efficiency, where the DDQ algorithm outperforms the traditional Delayed Q - learning and R - max algorithms. ### Application Background An important application scenario of this algorithm is in the field of early childhood motor rehabilitation. In this field, robots can be used as intelligent toys to have social interactions with infants with special needs, helping them carry out game - based activities to achieve the best rehabilitation results. In this application, the MDP model can capture the social interaction dynamics between infants and robots, and the DDQ algorithm can guide the robot's behavior to achieve the maximum rehabilitation effect. In conclusion, this paper aims to solve the deficiencies of existing PAC MDP algorithms in sample complexity and performance through the DDQ algorithm, and provide more efficient and accurate solutions for practical applications.