Abstract:Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that most of the existing human pose understanding, generation and editing tasks are carried out in isolation, lacking a unified multimodal framework to handle these tasks simultaneously. Specifically: 1. **Limitations of Existing Work**: - Current research usually only supports single - mode control signals (such as images, texts or 3D SMPL poses), and operations between different tasks are isolated, which limits their applications in real - world scenarios. - Existing multimodal large - scale models (MLLMs) still have deficiencies when dealing with human poses, especially in fine - grained pose perception and complex relationship understanding. 2. **Goals of UniPose**: - Propose a unified multimodal framework that can handle human pose understanding, generation and editing tasks simultaneously. - By using large - language models (LLMs) and hybrid visual encoders, enhance the ability to process different modal data, including images, texts and 3D SMPL poses. - Achieve cross - task knowledge transfer, enabling the model to perform well in unseen tasks and have extensibility. ### Main Challenges 1. **Unified Representation Space**: - Create a unified representation space so that 3D poses and texts can be processed in the same vocabulary. Existing methods usually encode 3D poses as continuous features and segment texts into discrete sequences. This non - unified processing method increases the difficulty for LLMs to model the interaction between the two. 2. **Fine - grained Pose Perception**: - Achieve fine - grained pose perception in the visual branch. Most existing MLLMs rely on CLIP's visual encoder, but CLIP's global supervision mechanism is difficult to capture detailed pixel - level information, such as key points and parsing maps. ### Solutions To address these challenges, the authors propose UniPose, a unified multimodal framework, which mainly includes the following components: 1. **Pose Tokenizer**: - Quantize the original 3D pose (represented by SMPL parameters) into a discrete token sequence, so that 3D poses and texts can be processed in the same vocabulary. - Use VQ - VAE (Vector Quantized Variational Autoencoders) to achieve a discrete representation of the pose. 2. **Visual Processor**: - Adopt a hybrid visual encoder, combining CLIP's original visual encoder and a pre - trained pose - specific visual Transformer (Pose - ViT), to enhance the ability to capture pose - related features. 3. **Mixed Attention Mechanism**: - Apply causal attention to text tokens and bidirectional attention to pose tokens to adapt to the spatial characteristics of pose tokens. Through these innovations, UniPose can perform well in a variety of pose - related tasks and demonstrates zero - shot generalization ability, such as text - enhanced pose estimation. ### Summary This paper aims to build a general - purpose multimodal framework that can seamlessly handle human pose understanding, generation and editing tasks, thereby promoting the application of human pose research in fields such as virtual reality and healthcare.

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

ChatPose: Chatting about 3D Human Pose

UniHuman: A Unified Model for Editing Human Images in the Wild

UniHCP: A Unified Model for Human-Centric Perceptions

PoseVocab: Learning Joint-structured Pose Embeddings for Human Avatar Modeling

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Uni3D-LLM: Unifying Point Cloud Perception, Generation and Editing with Large Language Models

LiCamPose: Combining Multi-View LiDAR and RGB Cameras for Robust Single-frame 3D Human Pose Estimation

MPM: A Unified 2D-3D Human Pose Representation via Masked Pose Modeling

Uni3DL: Unified Model for 3D and Language Understanding

Human Pose as Compositional Tokens

VirtualPose: Learning Generalizable 3D Human Pose Models from Virtual Data

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

LAMP: Leveraging Language Prompts for Multi-person Pose Estimation

Keypoints-Integrated Instruction-Following Data Generation for Enhanced Human Pose Understanding in Multimodal Models

UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

GatedUniPose: A Novel Approach for Pose Estimation Combining UniRepLKNet and Gated Convolution

PoseScript: Linking 3D Human Poses and Natural Language

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

SD-Pose: facilitating space-decoupled human pose estimation via adaptive pose perception guidance

Learning Pose Grammar for Monocular 3 D Pose Estimation