Controllable Motion Synthesis and Reconstruction with Autoregressive Diffusion Models

Wenjie Yin,Ruibo Tu,Hang Yin,Danica Kragic,Hedvig Kjellström,Mårten Björkman
2023-04-03
Abstract:Data-driven and controllable human motion synthesis and prediction are active research areas with various applications in interactive media and social robotics. Challenges remain in these fields for generating diverse motions given past observations and dealing with imperfect poses. This paper introduces MoDiff, an autoregressive probabilistic diffusion model over motion sequences conditioned on control contexts of other modalities. Our model integrates a cross-modal Transformer encoder and a Transformer-based decoder, which are found effective in capturing temporal correlations in motion and control modalities. We also introduce a new data dropout method based on the diffusion forward process to provide richer data representations and robust generation. We demonstrate the superior performance of MoDiff in controllable motion synthesis for locomotion with respect to two baselines and show the benefits of diffusion data dropout for robust synthesis and reconstruction of high-fidelity motion close to recorded data.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of generating diverse and controllable human motions given past observational data, and handling imperfect poses. Specifically, the researchers propose a method based on an autoregressive diffusion model (MoDiff), aimed at improving the quality and robustness of human motion synthesis and prediction. Traditional deterministic models perform poorly in generating diverse actions, while probabilistic models, although capable of generating more diverse actions, still face challenges in handling long-term generation and imperfect data. MoDiff introduces a cross-modal Transformer encoder and a Transformer-based decoder to capture temporal correlations and control modalities, while also proposing a novel data dropout method to provide richer data representations and robust generation. The paper demonstrates MoDiff's superior performance in controllable motion synthesis and reconstruction. Summary: - **Problem**: Generating diverse and controllable human motions, handling imperfect poses. - **Method**: Proposes MoDiff, a framework combining autoregressive diffusion model and cross-modal Transformer. - **Innovation**: Introduces a data dropout method to enhance model robustness and diversity. - **Applications**: Applicable to fields such as interactive media and social robots.