Haoran He,Peilin Wu,Chenjia Bai,Hang Lai,Lingxiao Wang,Ling Pan,Xiaolin Hu,Weinan Zhang
Abstract:Reinforcement Learning (RL) has recently achieved remarkable success in robotic control. However, most works in RL operate in simulated environments where privileged knowledge (e.g., dynamics, surroundings, terrains) is readily available. Conversely, in real-world scenarios, robot agents usually rely solely on local states (e.g., proprioceptive feedback of robot joints) to select actions, leading to a significant sim-to-real gap. Existing methods address this gap by either gradually reducing the reliance on privileged knowledge or performing a two-stage policy imitation. However, we argue that these methods are limited in their ability to fully leverage the available privileged knowledge, resulting in suboptimal performance. In this paper, we formulate the sim-to-real gap as an information bottleneck problem and therefore propose a novel privileged knowledge distillation method called the Historical Information Bottleneck (HIB). In particular, HIB learns a privileged knowledge representation from historical trajectories by capturing the underlying changeable dynamic information. Theoretical analysis shows that the learned privileged knowledge representation helps reduce the value discrepancy between the oracle and learned policies. Empirical experiments on both simulated and real-world tasks demonstrate that HIB yields improved generalizability compared to previous methods. Videos of real-world experiments are available at <a class="link-external link-https" href="https://sites.google.com/view/history-ib" rel="external noopener nofollow">this https URL</a> .
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the "sim - to - real gap" problem encountered when reinforcement learning (RL) migrates from the simulated environment to the real environment. Specifically:
1. **Differences between simulated and real environments**: Most RL research is carried out in simulated environments because privileged knowledge (such as dynamic characteristics, surrounding environment and terrain information) can be easily obtained in these environments. However, in the real world, robots usually can only rely on local states (such as proprioceptive feedback of joints) to select actions, which leads to a significant sim - to - real gap.
2. **Limitations of existing methods**:
- Existing methods narrow this gap by gradually reducing the dependence on privileged knowledge or performing two - stage policy imitation.
- These methods fail to fully utilize the available privileged knowledge, resulting in sub - optimal performance.
3. **The proposed new method**: To solve the above problems, the author proposes a new method based on the Historical Information Bottleneck (HIB), regarding the sim - to - real gap as an information bottleneck problem. HIB captures the dynamic change information in historical trajectories, learns the representation of privileged knowledge, and proves that this representation helps to reduce the value difference between the optimal policy and the learned policy.
### Main contributions
1. **Proposing the HIB method**: Use historical information to extract the representation of privileged knowledge from a fixed - length history, so as to make better use of the privileged knowledge in the simulation.
2. **Theoretical analysis**: Provide theoretical analysis of traditional policy imitation algorithms and the new method, showing the importance of minimizing the privileged modeling error for learning an approximately optimal policy.
3. **Experimental proof**: Experiments are carried out in simulated and real - world tasks, and the results show that HIB has better generalization performance compared with the existing state - of - the - art methods, especially excellent in out - of - distribution test environments.
### Summary of mathematical formulas
- **Value difference theorem**:
\[
\sup_{s_l, s_p, a} \left\| Q^*(s_l, s_p, a) - \hat{Q}_{\hat{\pi}}(s_l, a) \right\| \leq \frac{2\gamma r_{\max}}{(1-\gamma)^2} \epsilon_{\hat{\pi}}
\]
where,
\[
\epsilon_{\hat{\pi}} = \sup_{s_l, s_p} D_{TV}\left(\pi^*(\cdot | s_l, s_p) \| \hat{\pi}(\cdot | s_l)\right)
\]
- **Privileged modeling difference theorem**:
\[
\sup_{t \geq t_0} \sup_{s_l, s_p, a} \left| Q^*(s_l_t, s_p_t, a_t) - \hat{Q}_t(s_l_t, \hat{s}_p_t, a_t) \right| \leq \frac{\Delta E}{1-\gamma} + \frac{2\gamma r_{\max}}{(1-\gamma)^2} \epsilon_{\hat{P}}
\]
where,
\[
\epsilon_{\hat{P}} = \sup_{t \geq t_0} \sup_{h_{t+1}} D_{TV}\left(P(\cdot | h_{t+1}) \| \hat{P}(\cdot | h_{t+1})\right)
\]
Through these formulas and theoretical analysis, the author proves the effectiveness and superiority of the HIB method.