Abstract:Reinforcement Learning (RL) has recently achieved remarkable success in robotic control. However, most works in RL operate in simulated environments where privileged knowledge (e.g., dynamics, surroundings, terrains) is readily available. Conversely, in real-world scenarios, robot agents usually rely solely on local states (e.g., proprioceptive feedback of robot joints) to select actions, leading to a significant sim-to-real gap. Existing methods address this gap by either gradually reducing the reliance on privileged knowledge or performing a two-stage policy imitation. However, we argue that these methods are limited in their ability to fully leverage the available privileged knowledge, resulting in suboptimal performance. In this paper, we formulate the sim-to-real gap as an information bottleneck problem and therefore propose a novel privileged knowledge distillation method called the Historical Information Bottleneck (HIB). In particular, HIB learns a privileged knowledge representation from historical trajectories by capturing the underlying changeable dynamic information. Theoretical analysis shows that the learned privileged knowledge representation helps reduce the value discrepancy between the oracle and learned policies. Empirical experiments on both simulated and real-world tasks demonstrate that HIB yields improved generalizability compared to previous methods. Videos of real-world experiments are available at <a class="link-external link-https" href="https://sites.google.com/view/history-ib" rel="external noopener nofollow">this https URL</a> .

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the "sim - to - real gap" problem encountered when reinforcement learning (RL) migrates from the simulated environment to the real environment. Specifically: 1. **Differences between simulated and real environments**: Most RL research is carried out in simulated environments because privileged knowledge (such as dynamic characteristics, surrounding environment and terrain information) can be easily obtained in these environments. However, in the real world, robots usually can only rely on local states (such as proprioceptive feedback of joints) to select actions, which leads to a significant sim - to - real gap. 2. **Limitations of existing methods**: - Existing methods narrow this gap by gradually reducing the dependence on privileged knowledge or performing two - stage policy imitation. - These methods fail to fully utilize the available privileged knowledge, resulting in sub - optimal performance. 3. **The proposed new method**: To solve the above problems, the author proposes a new method based on the Historical Information Bottleneck (HIB), regarding the sim - to - real gap as an information bottleneck problem. HIB captures the dynamic change information in historical trajectories, learns the representation of privileged knowledge, and proves that this representation helps to reduce the value difference between the optimal policy and the learned policy. ### Main contributions 1. **Proposing the HIB method**: Use historical information to extract the representation of privileged knowledge from a fixed - length history, so as to make better use of the privileged knowledge in the simulation. 2. **Theoretical analysis**: Provide theoretical analysis of traditional policy imitation algorithms and the new method, showing the importance of minimizing the privileged modeling error for learning an approximately optimal policy. 3. **Experimental proof**: Experiments are carried out in simulated and real - world tasks, and the results show that HIB has better generalization performance compared with the existing state - of - the - art methods, especially excellent in out - of - distribution test environments. ### Summary of mathematical formulas - **Value difference theorem**: \[ \sup_{s_l, s_p, a} \left\| Q^*(s_l, s_p, a) - \hat{Q}_{\hat{\pi}}(s_l, a) \right\| \leq \frac{2\gamma r_{\max}}{(1-\gamma)^2} \epsilon_{\hat{\pi}} \] where, \[ \epsilon_{\hat{\pi}} = \sup_{s_l, s_p} D_{TV}\left(\pi^*(\cdot | s_l, s_p) \| \hat{\pi}(\cdot | s_l)\right) \] - **Privileged modeling difference theorem**: \[ \sup_{t \geq t_0} \sup_{s_l, s_p, a} \left| Q^*(s_l_t, s_p_t, a_t) - \hat{Q}_t(s_l_t, \hat{s}_p_t, a_t) \right| \leq \frac{\Delta E}{1-\gamma} + \frac{2\gamma r_{\max}}{(1-\gamma)^2} \epsilon_{\hat{P}} \] where, \[ \epsilon_{\hat{P}} = \sup_{t \geq t_0} \sup_{h_{t+1}} D_{TV}\left(P(\cdot | h_{t+1}) \| \hat{P}(\cdot | h_{t+1})\right) \] Through these formulas and theoretical analysis, the author proves the effectiveness and superiority of the HIB method.

Bridging the Sim-to-Real Gap from the Information Bottleneck Perspective

Privileged Knowledge Distillation for Sim-to-Real Policy Generalization

Off-Dynamics Inverse Reinforcement Learning

Off-Dynamics Inverse Reinforcement Learning from Hetero-Domain

Bridging RL Theory and Practice with the Effective Horizon

Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL

Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale

Imitation Bootstrapped Reinforcement Learning

Real–Sim–Real Transfer for Real-World Robot Control Policy Learning with Deep Reinforcement Learning

Quantile Regression Hindsight Experience Replay

Learning Representations in Reinforcement Learning:An Information Bottleneck Approach

RLIF: Interactive Imitation Learning as Reinforcement Learning

Distance-rank Aware Sequential Reward Learning for Inverse Reinforcement Learning with Sub-optimal Demonstrations

Learning to Bridge the Gap: Efficient Novelty Recovery with Planning and Reinforcement Learning

Dynamics Generalization via Information Bottleneck in Deep Reinforcement Learning

Hybrid Inverse Reinforcement Learning

Bridging the simulation-to-real gap of depth images for deep reinforcement learning

Bridging Imagination and Reality for Model-Based Deep Reinforcement Learning

VR-Goggles for Robots: Real-to-Sim Domain Adaptation for Visual Control

Hindsight States: Blending Sim and Real Task Elements for Efficient Reinforcement Learning

DROID: Minimizing the Reality Gap Using Single-Shot Human Demonstration