Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion Retrieval

Haoyu Shi,Huaiwen Zhang
DOI: https://doi.org/10.1145/3664647.3681625
2024-01-01
Abstract:Text to Motion Retrieval (TMR) is an emerging task to retrieve relevant motion sequences with the nature language description. The dominant approach learns a joint embedding space to measure global-level similarities. However, simple global embeddings are insufficient to represent complicated motion and textual details, such as the movement of specific body parts and the coordination among these body parts. In addition, most of the motion variations occur subtly and locally, resulting in semantic vagueness among these motions, which further presents considerable challenges in precisely aligning motion sequences with texts. To address these challenges, we propose a novel Modal-Enhanced Semantic Modeling (MESM) method, focusing on fine-grained alignment through enhanced modal semantics. Specifically, we develop a prompt-enhanced textual module (PTM) to generate detailed descriptions of specific body part movements, which comprehensively captures the fine-grained textual semantics for precise matching. We employ a skeleton-enhanced motion module (SMM) to effectively enhance the model's capability to represent intricate motions. This module leverages a graph convolutional network to meticulously model the intricate spatial dependencies among relevant body parts. To improve the sensitivity to the subtle motions, we further propose a text-driven semantics interaction module (TSIM). The TSIM assigns motion features into a set of aggregated descriptors and employs cross-attention to aggregate discriminative motion embeddings guided by text, enabling precise semantic alignment between subtle motions and corresponding texts. Extensive experiments conducted on two widely used benchmark datasets, HumanML3D and KIT-ML, demonstrate the effectiveness of our proposed method. Our approach outperforms existing state-of-the-art retrieval methods, achieving significant Rsum improvements of 24.28% on HumanML3D and 25.80% on KIT-ML.
What problem does this paper attempt to address?