D-LORD for Motion Stylization

Meenakshi Gupta,Mingyuan Lei,Tat-Jen Cham,Hwee Kuan Lee
DOI: https://doi.org/10.1109/TSMC.2024.3502498
2024-12-05
Abstract:This paper introduces a novel framework named D-LORD (Double Latent Optimization for Representation Disentanglement), which is designed for motion stylization (motion style transfer and motion retargeting). The primary objective of this framework is to separate the class and content information from a given motion sequence using a data-driven latent optimization approach. Here, class refers to person-specific style, such as a particular emotion or an individual's identity, while content relates to the style-agnostic aspect of an action, such as walking or jumping, as universally understood concepts. The key advantage of D-LORD is its ability to perform style transfer without needing paired motion data. Instead, it utilizes class and content labels during the latent optimization process. By disentangling the representation, the framework enables the transformation of one motion sequences style to another's style using Adaptive Instance Normalization. The proposed D-LORD framework is designed with a focus on generalization, allowing it to handle different class and content labels for various applications. Additionally, it can generate diverse motion sequences when specific class and content labels are provided. The framework's efficacy is demonstrated through experimentation on three datasets: the CMU XIA dataset for motion style transfer, the MHAD dataset, and the RRIS Ability dataset for motion retargeting. Notably, this paper presents the first generalized framework for motion style transfer and motion retargeting, showcasing its potential contributions in this area.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are several key challenges in existing motion style transfer and motion retargeting algorithms. Specifically, these problems include: 1. **Difficulties in Adversarial Training**: Existing methods usually rely on adversarial training, which makes the model difficult to train and requires careful adjustment. 2. **Root Motion Preservation Problem**: During the style transfer process, if the target style is associated with specific content features, preserving the root motion may reduce the quality of the style transfer. 3. **Spatial Relationships between Joints**: When using 1D convolution to generate content encodings, the spatial relationships between joints cannot be considered, resulting in poor performance in preserving motion content in significantly different action types. 4. **Root Trajectory Calculation**: Directly calculating the root trajectory from the source motion will weaken the stylization effect, especially when converting high - intensity styles to walking motions. 5. **Deterministic Output**: Given a pair of input content and style motions, a deterministic output will be produced, lacking diversity. To solve the above problems, the paper proposes a new framework named D - LORD (Double - Latent Optimization for Representation Disentanglement). D - LORD solves these problems in the following ways: - **Dual - Latent Optimization**: D - LORD utilizes latent optimization techniques to decompose motion data into three latent variables - class, content, and aleatoric uncertainty. This decomposition enables style transfer to be performed without relying on paired motion data, but only using class and content labels. - **No Need for Paired Data**: D - LORD does not require paired motion data, but uses labeled motion data to decouple classes (such as emotions, subject IDs) and contents (such as actions), thereby improving the generalization ability of the model. - **Generation of Diverse Motion Sequences**: By sampling the aleatoric uncertainty latent variable from a Gaussian distribution, D - LORD can generate diverse motion sequences, enhancing the diversity and richness of animations. - **Accurate Motion Style Adaptation**: By decoupling class and content features, D - LORD can adapt style - specific motion features to the content, and it can work effectively even when the target style is associated with specific motion features. Overall, D - LORD provides a general motion stylization framework, simplifies the training process, and can perform effective style transfer between different types of motions while generating diverse motion sequences. In addition, it is also applicable to motion retargeting tasks, demonstrating its potential contribution in this field. ### Mathematical Formula Representation To ensure the correctness and readability of the formulas, the following are the key formulas involved in the paper: 1. **Reconstruction Loss**: \[ L_i^r=\| G_{\theta_G}(c_{x_i}, a_i, e_{y_i})-M_i \|_2 \] where \( G_{\theta_G} \) is the parameter of the generator network, \( c_{x_i} \) is the class embedding, \( a_i \) is the aleatoric uncertainty embedding, \( e_{y_i} \) is the content embedding, and \( M_i \) is the actual motion sequence. 2. **Bone Consistency Loss**: \[ L_i^s = \frac{1}{T - 1}\sum_{t = 1}^{T}(l_t-\bar{l}_i)^2 \] where \( \bar{l}_i \) is the average bone length of the generated motion sequence \( \hat{M}_i \), and \( l_t \) is the bone length of the generated motion sequence \( \hat{M}_i \) at the \( t \)-th frame. 3. **Objective Function in the First Stage**: \[ L_1=\sum_{i = 1}^{n}L_i^r+\lambda \| a_i \|_2+L_i^s \] where \( \lambda \)