Abstract:Abstract 3D pose transfer over unorganized point clouds is a challenging generation task, which transfers a source’s pose to a target shape and keeps the target’s identity. Recent deep models have learned deformations and used the target’s identity as a style to modulate the combined features of two shapes or the aligned vertices of the source shape. However, all operations in these models are point-wise and independent and ignore the geometric information on the surface and structure of the input shapes. This disadvantage severely limits the generation and generalization capabilities. In this study, we propose a geometry-aware method based on a novel transformer autoencoder to solve this problem. An efficient self-attention mechanism, that is, cross-covariance attention, was utilized across our framework to perceive the correlations between points at different distances. Specifically, the transformer encoder extracts the target shape’s local geometry details for identity attributes and the source shape’s global geometry structure for pose information. Our transformer decoder efficiently learns deformations and recovers identity properties by fusing and decoding the extracted features in a geometry attentional manner, which does not require corresponding information or modulation steps. The experiments demonstrated that the geometry-aware method achieved state-of-the-art performance in a 3D pose transfer task. The implementation code and data are available at https://github.com/SEULSH/Geometry-Aware-3D-Pose-Transfer-Using-Transformer-Autoencoder .
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **When performing 3D pose transfer on unordered point clouds, how to maintain the identity characteristics of the target shape while effectively capturing and utilizing the geometric information of the input shape?**
Specifically, existing deep - learning methods, when dealing with 3D pose transfer tasks, usually ignore the geometric information of the input shape (such as local surface details and global structure information), and independently map point features in the encoder, resulting in poor - quality generated shapes. In addition, these methods often require complex frameworks to adjust the latent features in the decoder to maintain the identity characteristics of the target shape.
To solve these problems, the author proposes a new method based on the Transformer auto - encoder, which is able to:
1. **Effectively capture the geometric information of the input shape**: By introducing the Cross - Covariance Attention mechanism (XCA), the model can dynamically learn the correlations between points within different distance ranges, thereby better capturing local and global geometric information.
2. **Simplify the generation process**: Since the features extracted by the encoder already contain complete geometric information, the decoder no longer needs to use the information of the target shape to adjust the latent features, thus simplifying the generation process and improving the generation quality.
3. **Improve the generation and generalization ability**: Experimental results show that this method is superior to existing deep - learning models in terms of generation and generalization ability.
### Formula Summary
- **XCA Attention Mechanism**:
\[
\text{XC - Attention}(Q, K, V)=V A_{XC}(K, Q)=V \text{Softmax}\left(\frac{\tilde{K}^T \tilde{Q}}{\tau}\right)
\]
where \(Q\in\mathbb{R}^{N\times d}\), \(K\in\mathbb{R}^{N\times d}\), \(V\in\mathbb{R}^{N\times d}\) are the query, key, and value matrices respectively, \(\tilde{K}\) and \(\tilde{Q}\) are the key and query matrices after \(l_2\) normalization, and \(\tau\) is a learnable temperature parameter.
- **Reconstruction Loss**:
\[
L_{\text{rec}}=\| V_{\text{gt}}-V_{\text{generate}}\|_2^2
\]
where \(V_{\text{gt}}\in\mathbb{R}^{N\times 3}\) and \(V_{\text{generate}}\in\mathbb{R}^{N\times 3}\) are the vertex coordinates of the real shape and the generated shape respectively.
- **Edge Loss**:
\[
L_{\text{edg}}=\sum_v\sum_{p\in N(v)}\| v - p\|_2^2
\]
where \(N(v)\) is the set of neighbors of point \(v\).
- **L2 Regularization Loss**:
\[
L_{L2}=\frac{1}{2n}\sum_w w^2
\]
where \(n\) is the total number of weight parameters in the model.
- **Total Loss Function**:
\[
L = \lambda_{\text{rec}}L_{\text{rec}}+\lambda_{\text{edg}}L_{\text{edg}}+\lambda_{L2}L_{L2}
\]
Through these improvements, this method can achieve better results in 3D pose transfer tasks and has higher generation quality and generalization ability.