Abstract:We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding. While existing works control avatar motion with global text conditioning, or with fine-grained per frame scripts, none can do both at once. In addition, none of the existing works can output frame-level text paired with the generated poses. In contrast, Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users. Importantly, Unimotion is the first model which by design outputs local text paired with the generated poses, allowing users to know what motion happens and when, which is necessary for a wide range of applications. We show Unimotion opens up new applications: 1.) Hierarchical control, allowing users to specify motion at different levels of detail, 2.) Obtaining motion text descriptions for existing MoCap data or YouTube videos 3.) Allowing for editability, generating motion from text, and editing the motion via text edits. Moreover, Unimotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset. The pre-trained model and code are available available on our project page at <a class="link-external link-https" href="https://coral79.github.io/uni-motion/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper introduces a unified multi-task model named **UniMotion**, aimed at addressing the issues of human motion synthesis and understanding. Specifically: 1. **Flexible Motion Control and Understanding**: Existing methods either allow motion control under global text conditions (e.g., text describing the entire sequence) or fine-grained control at the local frame level, but cannot achieve both simultaneously. UniMotion can control motions through global text, local frame-level text, or a combination of both, and can also output frame-level text descriptions corresponding to each pose. 2. **Hierarchical Control Capability**: Users can specify motions at different levels of abstraction for more flexible control. For example, one can specify the general motion of the arms through global text and the specific motion sequences of other body parts through local text. 3. **Motion Editing Functionality**: Edit generated motions based on text descriptions. Users can first generate initial motions and their corresponding frame-level text descriptions, then regenerate the desired motions by editing these text segments. 4. **New Application Scenarios**: - **2D Video Annotation**: Add frame-level text annotations to human motions extracted from YouTube videos, which can serve as subtitles for visually impaired individuals. - **4D Motion Capture Data Annotation**: Add frame-level text descriptions to motion capture data obtained from inertial measurement units (IMUs) for easier retrieval and analysis. - **Hierarchical Control**: Allow users to specify motion details at different levels of abstraction. - **Motion Editing**: Used for motion adjustments in animation production. Through the above functionalities, UniMotion not only unifies the tasks of motion synthesis and understanding but also introduces new tasks, such as unconditionally generating human motions with frame-level text descriptions and generating frame-level text descriptions from motions. These features make UniMotion widely applicable in various practical applications.

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models

Everything2Motion: Synchronizing Diverse Inputs Via a Unified Framework for Human Motion Synthesis

UniMuMo: Unified Text, Music and Motion Generation

DiverseMotion: Towards Diverse Human Motion Generation Via Discrete Diffusion

CoMA: Compositional Human Motion Generation with Multi-modal Agents

HumanTOMATO: Text-aligned Whole-body Motion Generation

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

TextIM: Part-aware Interactive Motion Synthesis from Text

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Contact-aware Human Motion Generation from Textual Descriptions

Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism

Large Motion Model for Unified Multi-Modal Motion Generation

Champ: Controllable and Consistent Human Image Animation with 3D Parametric Guidance

Plan, Posture and Go: Towards Open-vocabulary Text-to-Motion Generation

MMM: Generative Masked Motion Model

HUMOS: Human Motion Model Conditioned on Body Shape