Vision Transformer based Random Walk for Group Re-Identification

Guoqing Zhang,Tianqi Liu,Wenxuan Fang,Yuhui Zheng
2024-10-08
Abstract:Group re-identification (re-ID) aims to match groups with the same people under different cameras, mainly involves the challenges of group members and layout changes well. Most existing methods usually use the k-nearest neighbor algorithm to update node features to consider changes in group membership, but these methods cannot solve the problem of group layout changes. To this end, we propose a novel vision transformer based random walk framework for group re-ID. Specifically, we design a vision transformer based on a monocular depth estimation algorithm to construct a graph through the average depth value of pedestrian features to fully consider the impact of camera distance on group members relationships. In addition, we propose a random walk module to reconstruct the graph by calculating affinity scores between target and gallery images to remove pedestrians who do not belong to the current group. Experimental results show that our framework is superior to most methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are two major challenges in **group re - ID (group re - identification)**: **group layout change** and **group member change**. Specifically: 1. **Group layout change**: Due to the limitations of different camera perspectives, the relative positions of group members in the group may be significantly different. Such changes make it difficult for methods based on fixed layouts to accurately match the same group. 2. **Group member change**: Group members may frequently join or leave the group, which further increases the difficulty of matching. Most of the existing methods usually use the k - nearest - neighbor algorithm to update node features to take into account the change of group members, but these methods cannot fundamentally solve the problem of group layout change. To solve these problems, the author proposes a random walk framework based on Vision Transformer, which specifically includes the following two main innovation points: - **Vision Transformer based on monocular depth estimation**: By embedding the depth values of pedestrians into the Vision Transformer, a graph structure is constructed, thereby fully considering the influence of camera distance on the relationship between group members. - **Random walk module**: By calculating the affinity scores between the target image and the library image, the graph structure is reconstructed, and pedestrians not belonging to the current group are removed, thereby effectively solving the problems of group member and layout changes. The experimental results show that this framework performs excellently on three group re - identification datasets and is superior to most of the existing methods. ### Formula display Some of the formulas involved in the paper are as follows: - **Random walk operation**: \[ y(t + 1)=W y(t) \] where \(y(t)\) is the vector of similarity scores between the probe image and all library images at the \(t\)-th random walk iteration, and \(W\) is the normalized similarity matrix. - **Normalized similarity matrix**: \[ W(i, j)=\frac{\exp(S(i, j))}{\sum_{j \neq i} \exp(S(i, j))} \] where \(S(i, j)\) is the matrix of similarity scores between the probe sequence and the library image. - **Attention weight calculation**: \[ a_{ij}=\text{softmax}(e_{ij})=\frac{\exp(e_{ij})}{\sum_{(i, k) \in E_s} \exp(e_{ik})} \] These formulas ensure that the model can effectively handle the complex changes in group re - identification tasks.