Offline Reinforcement Learning at Multiple Frequencies

Kaylee Burns,Tianhe Yu,Chelsea Finn,Karol Hausman
DOI: https://doi.org/10.48550/arXiv.2207.13082
2022-07-27
Abstract:Leveraging many sources of offline robot data requires grappling with the heterogeneity of such data. In this paper, we focus on one particular aspect of heterogeneity: learning from offline data collected at different control frequencies. Across labs, the discretization of controllers, sampling rates of sensors, and demands of a task of interest may differ, giving rise to a mixture of frequencies in an aggregated dataset. We study how well offline reinforcement learning (RL) algorithms can accommodate data with a mixture of frequencies during training. We observe that the $Q$-value propagates at different rates for different discretizations, leading to a number of learning challenges for off-the-shelf offline RL. We present a simple yet effective solution that enforces consistency in the rate of $Q$-value updates to stabilize learning. By scaling the value of $N$ in $N$-step returns with the discretization size, we effectively balance $Q$-value propagation, leading to more stable convergence. On three simulated robotic control problems, we empirically find that this simple approach outperforms naïve mixing by 50% on average.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of **how to effectively utilize data with different control frequencies in off - policy reinforcement learning (Off - policy Reinforcement Learning, Off - policy RL)**. Specifically, the paper focuses on how to handle heterogeneous data from multiple data sources in robot learning, especially when this data is collected at different control frequencies. #### Background and Challenges 1. **Data Heterogeneity**: The discretization of controllers, sensor sampling rates, and task requirements in different laboratories or scenarios may be different, resulting in data with multiple frequencies in an aggregated dataset. 2. **Inconsistent Q - value Propagation Rates**: Due to different discretization step sizes, the Q - value propagation rates at different frequencies are different, which poses a challenge to off - policy reinforcement learning algorithms and may lead to training instability and performance degradation. 3. **Insufficiency of Existing Methods**: Although previous work has studied how to stabilize RL algorithms at a single high frequency, there has been no in - depth research on how to effectively learn from multi - frequency data. #### Research Objectives The main objective of the paper is to analyze and provide a solution to the problems encountered when using multi - frequency data in off - policy reinforcement learning. Specific objectives include: - **Problem Analysis**: Verify that mixing different discretized data will lead to inconsistent Q - value update rates, thus affecting performance. - **Propose a Solution**: By introducing Adaptive N - Step Returns, adjust the Q - value update rates under different discretizations to make them more consistent, thereby improving the stability and performance of training. #### Method Overview The authors propose a simple but effective method - **Adaptive N - Step Returns**, which keeps the Q - value update rates under different discretizations consistent by adjusting the value of N in the N - Step Returns according to time discretization. The specific formula is as follows: \[ Q_{\delta t}^{\text{target}}=\sum_{t' = 0}^{N/\delta t - 1}\gamma^{t'}r(s_{t + t'},a_{t + t'})+\gamma^{N/\delta t}Q_{\delta t}(s_{t + N/\delta t},a_{t + N/\delta t}) \] This method ensures that the Q - value update rates of different - frequency data are consistent, thereby improving the stability of training and the final performance. #### Experimental Results The experimental results show that the Adaptive N - Step Returns method significantly outperforms the simple data mixing method in multiple simulated environments, especially in sparse - reward environments. For example, in the Pendulum task, the Adaptive N - Step Returns method increased the average return by more than 50%; in the Kitchen task, it almost achieved a two - fold increase in the average return. In conclusion, this paper provides new ideas and technical means for robot learning by analyzing and solving the challenges brought by multi - frequency data in off - policy reinforcement learning.