Abstract:Abstract 3D pose transfer over unorganized point clouds is a challenging generation task, which transfers a source’s pose to a target shape and keeps the target’s identity. Recent deep models have learned deformations and used the target’s identity as a style to modulate the combined features of two shapes or the aligned vertices of the source shape. However, all operations in these models are point-wise and independent and ignore the geometric information on the surface and structure of the input shapes. This disadvantage severely limits the generation and generalization capabilities. In this study, we propose a geometry-aware method based on a novel transformer autoencoder to solve this problem. An efficient self-attention mechanism, that is, cross-covariance attention, was utilized across our framework to perceive the correlations between points at different distances. Specifically, the transformer encoder extracts the target shape’s local geometry details for identity attributes and the source shape’s global geometry structure for pose information. Our transformer decoder efficiently learns deformations and recovers identity properties by fusing and decoding the extracted features in a geometry attentional manner, which does not require corresponding information or modulation steps. The experiments demonstrated that the geometry-aware method achieved state-of-the-art performance in a 3D pose transfer task. The implementation code and data are available at https://github.com/SEULSH/Geometry-Aware-3D-Pose-Transfer-Using-Transformer-Autoencoder .

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **When performing 3D pose transfer on unordered point clouds, how to maintain the identity characteristics of the target shape while effectively capturing and utilizing the geometric information of the input shape?** Specifically, existing deep - learning methods, when dealing with 3D pose transfer tasks, usually ignore the geometric information of the input shape (such as local surface details and global structure information), and independently map point features in the encoder, resulting in poor - quality generated shapes. In addition, these methods often require complex frameworks to adjust the latent features in the decoder to maintain the identity characteristics of the target shape. To solve these problems, the author proposes a new method based on the Transformer auto - encoder, which is able to: 1. **Effectively capture the geometric information of the input shape**: By introducing the Cross - Covariance Attention mechanism (XCA), the model can dynamically learn the correlations between points within different distance ranges, thereby better capturing local and global geometric information. 2. **Simplify the generation process**: Since the features extracted by the encoder already contain complete geometric information, the decoder no longer needs to use the information of the target shape to adjust the latent features, thus simplifying the generation process and improving the generation quality. 3. **Improve the generation and generalization ability**: Experimental results show that this method is superior to existing deep - learning models in terms of generation and generalization ability. ### Formula Summary - **XCA Attention Mechanism**: \[ \text{XC - Attention}(Q, K, V)=V A_{XC}(K, Q)=V \text{Softmax}\left(\frac{\tilde{K}^T \tilde{Q}}{\tau}\right) \] where \(Q\in\mathbb{R}^{N\times d}\), \(K\in\mathbb{R}^{N\times d}\), \(V\in\mathbb{R}^{N\times d}\) are the query, key, and value matrices respectively, \(\tilde{K}\) and \(\tilde{Q}\) are the key and query matrices after \(l_2\) normalization, and \(\tau\) is a learnable temperature parameter. - **Reconstruction Loss**: \[ L_{\text{rec}}=\| V_{\text{gt}}-V_{\text{generate}}\|_2^2 \] where \(V_{\text{gt}}\in\mathbb{R}^{N\times 3}\) and \(V_{\text{generate}}\in\mathbb{R}^{N\times 3}\) are the vertex coordinates of the real shape and the generated shape respectively. - **Edge Loss**: \[ L_{\text{edg}}=\sum_v\sum_{p\in N(v)}\| v - p\|_2^2 \] where \(N(v)\) is the set of neighbors of point \(v\). - **L2 Regularization Loss**: \[ L_{L2}=\frac{1}{2n}\sum_w w^2 \] where \(n\) is the total number of weight parameters in the model. - **Total Loss Function**: \[ L = \lambda_{\text{rec}}L_{\text{rec}}+\lambda_{\text{edg}}L_{\text{edg}}+\lambda_{L2}L_{L2} \] Through these improvements, this method can achieve better results in 3D pose transfer tasks and has higher generation quality and generalization ability.

Geometry-aware 3D pose transfer using transformer autoencoder

Geometry-Contrastive Transformer for Generalized 3D Pose Transfer

Global Adaptation Meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation.

TransPose: 6D object pose estimation with geometry-aware Transformer

Non-corresponding and topology-free 3D face expression transfer

Geometry-Biased Transformer for Robust Multi-View 3D Human Pose Reconstruction

Neural Pose Transfer by Spatially Adaptive Instance Normalization

Zero-Shot 3d Pose Estimation of Unseen Object by Two-Step Rgb-D Fusion

Global-correlated 3D-decoupling Transformer for Clothed Avatar Reconstruction

Unsupervised Geodesic-preserved Generative Adversarial Networks for Unconstrained 3D Pose Transfer

Geometric Point Attention Transformer for 3D Shape Reassembly

Auto-Encoding Transformations in Reparameterized Lie Groups for Unsupervised Learning.

EGCT: Enhanced Graph Convolutional Transformer for 3D Point Cloud Representation Learning

Unsupervised 3D Pose Transfer with Cross Consistency and Dual Reconstruction

3D hand pose and mesh estimation via a generic Topology-aware Transformer model

Geometry-Guided Diffusion Model with Masked Transformer for Robust Multi-View 3D Human Pose Estimation

Multiple View Geometry Transformers for 3D Human Pose Estimation

Learning scale-aware relationships via Laplacian decomposition-based transformer for 3D human pose estimation

Weakly-Supervised Discovery of Geometry-Aware Representation for 3D Human Pose Estimation

A 3D Mesh-based Lifting-and-Projection Network for Human Pose Transfer