AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents

Jieming Cui,Tengyu Liu,Nian Liu,Yaodong Yang,Yixin Zhu,Siyuan Huang
2024-03-19
Abstract:Traditional approaches in physics-based motion generation, centered around imitation learning and reward shaping, often struggle to adapt to new scenarios. To tackle this limitation, we propose AnySkill, a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction, AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy, which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths, marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper aims to address the issue of interactive virtual agents in the context of physical skill learning, specifically how to enable these agents to generate natural and physically plausible action sequences based on open vocabulary (i.e., unseen textual descriptions). Specifically, the paper proposes a new method called AnySkill, which combines low-level controllers with high-level strategies. It acquires a series of basic actions through Generative Adversarial Imitation Learning (GAIL) and uses an image-based reward mechanism to optimize these actions to match given textual instructions. This approach allows virtual agents to perform complex interactive tasks in new scenarios without manually designing reward functions. The paper demonstrates the superior performance of AnySkill in executing various open vocabulary physical skills and proves its superiority over existing methods in both qualitative and quantitative evaluations. Additionally, AnySkill shows the ability to interact with dynamic objects (such as a soccer ball and a door), further validating its application potential in complex environments.