World Model-based Perception for Visual Legged Locomotion

Hang Lai,Jiahang Cao,Jiafeng Xu,Hongtao Wu,Yunfeng Lin,Tao Kong,Yong Yu,Weinan Zhang
2024-09-25
Abstract:Legged locomotion over various terrains is challenging and requires precise perception of the robot and its surroundings from both proprioception and vision. However, learning directly from high-dimensional visual input is often data-inefficient and intricate. To address this issue, traditional methods attempt to learn a teacher policy with access to privileged information first and then learn a student policy to imitate the teacher's behavior with visual input. Despite some progress, this imitation framework prevents the student policy from achieving optimal performance due to the information gap between inputs. Furthermore, the learning process is unnatural since animals intuitively learn to traverse different terrains based on their understanding of the world without privileged knowledge. Inspired by this natural ability, we propose a simple yet effective method, World Model-based Perception (WMP), which builds a world model of the environment and learns a policy based on the world model. We illustrate that though completely trained in simulation, the world model can make accurate predictions of real-world trajectories, thus providing informative signals for the policy controller. Extensive simulated and real-world experiments demonstrate that WMP outperforms state-of-the-art baselines in traversability and robustness. Videos and Code are available at: <a class="link-external link-https" href="https://wmp-loco.github.io/" rel="external noopener nofollow">this https URL</a>.
Robotics,Machine Learning
What problem does this paper attempt to address?
The paper aims to address challenging issues in visual legged locomotion, particularly how to utilize visual information for precise perception and efficient learning. Specifically, the paper proposes a new framework called World Model-based Perception (WMP) to overcome some limitations of existing methods. Traditional methods typically adopt a privileged learning framework, where a teacher policy that has access to low-dimensional privileged information (such as scan points around the robot) is first trained, and then a student policy based on visual input is trained to mimic the teacher's behavior. However, this approach has the following problems: 1. **Performance Gap**: Due to generalization errors, the student policy finds it difficult to fully replicate the performance of the teacher policy. 2. **Design Complexity**: The teacher policy requires access to various types of additional information, which is challenging to achieve in practical applications. 3. **Data Inefficiency**: Learning policies directly from high-dimensional pixel inputs is very data-intensive, and for forward-facing cameras, the policy needs to remember past perception information to predict the terrain under the robot's feet. WMP extracts useful information by constructing a world model of the environment and learns policies based on this model. This approach not only avoids the limitations of privileged learning but also naturally compresses a series of high-dimensional perceptions into meaningful representations to aid decision-making. Experimental results show that WMP outperforms existing baseline methods on various terrains and can successfully handle complex terrains in real-world environments.