Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

Zhenyu Jiang,Yuqi Xie,Jinhan Li,Ye Yuan,Yifeng Zhu,Yuke Zhu

2024-10-17

Abstract:Humanoid robots, with their human-like embodiment, have the potential to integrate seamlessly into human environments. Critical to their coexistence and cooperation with humans is the ability to understand natural language communications and exhibit human-like behaviors. This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions. We leverage human motion priors from extensive human motion datasets to initialize humanoid motions and employ the commonsense reasoning capabilities of Vision Language Models (VLMs) to edit and refine these motions. Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions, validated through both simulated and real-world experiments. More videos can be found at <a class="link-external link-https" href="https://ut-austin-rpl.github.io/Harmon/" rel="external noopener nofollow">this https URL</a>.

Robotics,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to generate full-body motions for humanoid robots from free-form language descriptions. Specifically, the authors aim to develop a method that can generate diverse humanoid robot behaviors that align with natural language instructions based on textual descriptions. This involves understanding natural language instructions and translating them into physical actions for humanoid robots, enabling these robots to exhibit human-like behaviors while performing tasks in human environments, thereby enhancing effective and safe collaboration with humans. The paper proposes a method named HARMON, which achieves this goal by combining human motion prior knowledge with the commonsense reasoning capabilities of Vision-Language Models (VLM). HARMON first generates human motions based on language descriptions and then retargets these motions to humanoid robots. To improve the alignment between the generated robot motions and the language descriptions, HARMON also utilizes VLM to generate head and finger movements and iteratively adjusts the body movements. Experimental results show that HARMON can generate natural, expressive, and highly consistent humanoid robot motions with the textual descriptions, and these motions can be executed on humanoid robots in both simulated environments and the real world.

Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions

Text-driven Visual Prosody Generation for Embodied Conversational Agents

HumRoboSim: an Autonomous Humanoid Robot Simulation System

Human-Robot Sign Language Motion Retargeting from Videos

Dynamic Movement Primitive Based Motion Retargeting for Dual-Arm Sign Language Motions

EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning

HARMONIOUS -- Human-like reactive motion control and multimodal perception for humanoid robots

HYPERmotion: Learning Hybrid Behavior Planning for Autonomous Loco-manipulation

Generating Holistic 3D Human Motion from Speech

A language‐directed virtual human motion generation approach based on musculoskeletal models

Words into Action: Learning Diverse Humanoid Robot Behaviors using Language Guided Iterative Motion Refinement

HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

SpeechAct: Towards Generating Whole-body Motion from Speech

Towards Enhanced Human Activity Recognition through Natural Language Generation and Pose Estimation

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication

Expressive Whole-Body Control for Humanoid Robots

Robot Interaction Behavior Generation based on Social Motion Forecasting for Human-Robot Interaction

Generative Expressive Robot Behaviors using Large Language Models

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

LaserHuman: Language-guided Scene-aware Human Motion Generation in Free Environment