BiPO: Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis

Seong-Eun Hong,Soobin Lim,Juyeong Hwang,Minwook Chang,Hyeongyeop Kang
2024-11-28
Abstract:Generating natural and expressive human motions from textual descriptions is challenging due to the complexity of coordinating full-body dynamics and capturing nuanced motion patterns over extended sequences that accurately reflect the given text. To address this, we introduce BiPO, Bidirectional Partial Occlusion Network for Text-to-Motion Synthesis, a novel model that enhances text-to-motion synthesis by integrating part-based generation with a bidirectional autoregressive architecture. This integration allows BiPO to consider both past and future contexts during generation while enhancing detailed control over individual body parts without requiring ground-truth motion length. To relax the interdependency among body parts caused by the integration, we devise the Partial Occlusion technique, which probabilistically occludes the certain motion part information during training. In our comprehensive experiments, BiPO achieves state-of-the-art performance on the HumanML3D dataset, outperforming recent methods such as ParCo, MoMask, and BAMM in terms of FID scores and overall motion quality. Notably, BiPO excels not only in the text-to-motion generation task but also in motion editing tasks that synthesize motion based on partially generated motion sequences and textual descriptions. These results reveal the BiPO's effectiveness in advancing text-to-motion synthesis and its potential for practical applications.
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: generating natural and expressive human body movements from text descriptions, which is challenging in coordinating whole - body dynamics and capturing subtle movement patterns, especially accurately reflecting the given text content when generating long - time - series movements. Existing methods are often difficult to handle the complex body part coordination problem, resulting in simplified representations and a lack of delicate coordination. Specifically, the paper points out: 1. **Limitations of existing methods**: - Existing methods are often difficult to model complex whole - body dynamics, resulting in simplified representations and a lack of delicate coordination between body parts. - Some methods handle each body part independently to capture unique movement patterns, but lack overall consistency. - Unidirectional autoregressive models (such as ParCo), although enhancing global motion consistency, their unidirectional structure limits the ability to predict future movements, affecting coordination in the long - time range. - Bidirectional models (such as MoMask and BAMM), although using past and future contexts, do not combine part - based generation, so they have insufficient control over individual body parts and usually require the real motion length as input, which is impractical in practical applications. 2. **The proposed new method**: - The paper introduces BiPO (Bidirectional Partial Occlusion Network for Text - to - Motion Synthesis), which is the first model that combines part - based generation with a bidirectional autoregressive architecture for text - to - motion synthesis without providing the real motion length. - BiPO relaxes the over - dependence between body parts through the Partial Occlusion (PO) technique, thereby achieving more independent part representations during the training process. 3. **Main contributions**: - **Integrating part - generation and bidirectional autoregression**: Combines detailed control of individual body parts and global motion consistency without the need to input the real motion length. - **Partial Occlusion technique (PO)**: Reduces the over - dependence between body parts in bidirectional models and promotes robust and independent part representations. - **Superior performance**: Achieves better FID scores and overall motion quality than existing state - of - the - art models (such as ParCo, MoMask and BAMM) on the HumanML3D dataset. In summary, this paper aims to solve the coordination and flexibility problems in existing text - to - motion generation methods through the BiPO model, especially in generating natural, complex and long - time - series human body movements.