Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning for Skeleton-Based Person Re-Identification

Haocong Rao,Chunyan Miao
2024-12-12
Abstract:Person re-identification (re-ID) via 3D skeleton data is a challenging task with significant value in many scenarios. Existing skeleton-based methods typically assume virtual motion relations between all joints, and adopt average joint or sequence representations for learning. However, they rarely explore key body structure and motion such as gait to focus on more important body joints or limbs, while lacking the ability to fully mine valuable spatial-temporal sub-patterns of skeletons to enhance model learning. This paper presents a generic Motif guided graph transformer with Combinatorial skeleton prototype learning (MoCos) that exploits structure-specific and gait-related body relations as well as combinatorial features of skeleton graphs to learn effective skeleton representations for person re-ID. In particular, motivated by the locality within joints' structure and the body-component collaboration in gait, we first propose the motif guided graph transformer (MGT) that incorporates hierarchical structural motifs and gait collaborative motifs, which simultaneously focuses on multi-order local joint correlations and key cooperative body parts to enhance skeleton relation learning. Then, we devise the combinatorial skeleton prototype learning (CSP) that leverages random spatial-temporal combinations of joint nodes and skeleton graphs to generate diverse sub-skeleton and sub-tracklet representations, which are contrasted with the most representative features (prototypes) of each identity to learn class-related semantics and discriminative skeleton representations. Extensive experiments validate the superior performance of MoCos over existing state-of-the-art models. We further show its generality under RGB-estimated skeletons, different graph modeling, and unsupervised scenarios.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of human Person Re - Identification (re - ID) based on 3D skeleton data. Specifically, the author points out several shortcomings of existing methods: 1. **Assume virtual motion relationships**: Existing methods usually assume that there are virtual motion relationships between all joints and use average joint or sequence representations for learning, but rarely explore key body structures and motions (such as gait) to focus on more important body joints or limbs. 2. **Lack of mining spatio - temporal sub - patterns**: Existing methods lack the ability to fully mine the valuable spatio - temporal sub - patterns in the skeleton, which limits the learning effect of the model. To solve these problems, the author proposes a new method - **Motif Guided Graph Transformer with Combinatorial Skeleton Prototype Learning (MoCos)**. This method improves the effect of human re - identification in the following ways: ### Method overview #### 1. Motif Guided Graph Transformer (MGT) MGT uses hierarchical structure motifs and gait - collaborative motifs to guide joint - relationship learning, thereby capturing richer skeletal patterns. Specifically: - **Hierarchical Structure Motif (HSM)**: By defining neighbor relationships of different orders, it captures multi - level dependencies between joints. - **Gait - Collaborative Motif (GCM)**: Focuses on the local and global motion relationships between upper - and lower - limb joints to enhance the learning of gait features. The formula is expressed as follows: \[ A_m(i,j) = \begin{cases} 1 & \text{if } j \in \bigcup_{k = 1}^m N_k(i) \\ 0 & \text{otherwise} \end{cases} \] where \( m\in\{1,2,3\} \), \( A_m\in\mathbb{R}^{J\times J} \) represents the m - order HSM matrix, and \( N_k(i) \) represents the k - order neighbor index of the i - th joint node. \[ B_m(i,j) = \begin{cases} 1 & \text{if } i\in I_m, j\in\bigcup_{k = 1}^2 I_k, j\neq i \\ 0 & \text{otherwise} \end{cases} \] where \( m\in\{1,2\} \), \( B_1, B_2\in\mathbb{R}^{J\times J} \) represent the GCM matrices of the upper and lower limbs, and \( I_1 \) and \( I_2 \) represent the index sets of the upper - and lower - limb joint nodes respectively. #### 2. Combinatorial Skeleton Prototype Learning (CSP) CSP generates combinatorial representations of sub - skeletons and sub - trajectory fragments through random masking, thereby learning more representative skeleton features. The specific steps are as follows: - **Sub - skeleton representation**: Generate sub - skeleton representations by randomly masking joint nodes. \[ \hat{v}_t=\frac{1}{N_S}\sum_{j = 1}^J x_j h_t^j \] where \( x_j\sim\text{Bernoulli}(1 - p_s) \), and \( N_S=\sum_{j = 1}^J x_j \) represents the number of unmasked nodes. - **Sub - trajectory fragment representation**: Generate sub - trajectory fragment representations by randomly masking consecutive skeleton frames. \[ V=\frac{1}{N_T}\sum_{t = 1}^f m_t\hat{v}_t \] where \( m_t\sim\text{Bernoulli}(1 - p_t) \), \( N_T