UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Yiheng Li,Ruibing Hou,Hong Chang,Shiguang Shan,Xilin Chen
2024-11-25
Abstract:Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is that most of the existing human pose understanding, generation and editing tasks are carried out in isolation, lacking a unified multimodal framework to handle these tasks simultaneously. Specifically: 1. **Limitations of Existing Work**: - Current research usually only supports single - mode control signals (such as images, texts or 3D SMPL poses), and operations between different tasks are isolated, which limits their applications in real - world scenarios. - Existing multimodal large - scale models (MLLMs) still have deficiencies when dealing with human poses, especially in fine - grained pose perception and complex relationship understanding. 2. **Goals of UniPose**: - Propose a unified multimodal framework that can handle human pose understanding, generation and editing tasks simultaneously. - By using large - language models (LLMs) and hybrid visual encoders, enhance the ability to process different modal data, including images, texts and 3D SMPL poses. - Achieve cross - task knowledge transfer, enabling the model to perform well in unseen tasks and have extensibility. ### Main Challenges 1. **Unified Representation Space**: - Create a unified representation space so that 3D poses and texts can be processed in the same vocabulary. Existing methods usually encode 3D poses as continuous features and segment texts into discrete sequences. This non - unified processing method increases the difficulty for LLMs to model the interaction between the two. 2. **Fine - grained Pose Perception**: - Achieve fine - grained pose perception in the visual branch. Most existing MLLMs rely on CLIP's visual encoder, but CLIP's global supervision mechanism is difficult to capture detailed pixel - level information, such as key points and parsing maps. ### Solutions To address these challenges, the authors propose UniPose, a unified multimodal framework, which mainly includes the following components: 1. **Pose Tokenizer**: - Quantize the original 3D pose (represented by SMPL parameters) into a discrete token sequence, so that 3D poses and texts can be processed in the same vocabulary. - Use VQ - VAE (Vector Quantized Variational Autoencoders) to achieve a discrete representation of the pose. 2. **Visual Processor**: - Adopt a hybrid visual encoder, combining CLIP's original visual encoder and a pre - trained pose - specific visual Transformer (Pose - ViT), to enhance the ability to capture pose - related features. 3. **Mixed Attention Mechanism**: - Apply causal attention to text tokens and bidirectional attention to pose tokens to adapt to the spatial characteristics of pose tokens. Through these innovations, UniPose can perform well in a variety of pose - related tasks and demonstrates zero - shot generalization ability, such as text - enhanced pose estimation. ### Summary This paper aims to build a general - purpose multimodal framework that can seamlessly handle human pose understanding, generation and editing tasks, thereby promoting the application of human pose research in fields such as virtual reality and healthcare.