Image-to-Video Re-Identification via Mutual Discriminative Knowledge Transfer

Pichao Wang,Fan Wang,Hao Li
DOI: https://doi.org/10.48550/arXiv.2201.08887
2022-01-22
Abstract:The gap in representations between image and video makes Image-to-Video Re-identification (I2V Re-ID) challenging, and recent works formulate this problem as a knowledge distillation (KD) process. In this paper, we propose a mutual discriminative knowledge distillation framework to transfer a video-based richer representation to an image based representation more effectively. Specifically, we propose the triplet contrast loss (TCL), a novel loss designed for KD. During the KD process, the TCL loss transfers the local structure, exploits the higher order information, and mitigates the misalignment of the heterogeneous output of teacher and student networks. Compared with other losses for KD, the proposed TCL loss selectively transfers the local discriminative features from teacher to student, making it effective in the ReID. Besides the TCL loss, we adopt mutual learning to regularize both the teacher and student networks training. Extensive experiments demonstrate the effectiveness of our method on the MARS, DukeMTMC-VideoReID and VeRi-776 benchmarks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the Image - to - Video Re - ID (I2V Re - ID) problem of cross - modal re - identification. Specifically, the representational differences between images and videos make the I2V Re - ID task challenging. Existing methods usually regard this problem as a Knowledge Distillation (KD) process, but these methods mainly focus on global matching and ignore the learning of local structures. ### Main problems: 1. **Representational gap**: There are significant differences in feature representations between images and videos, resulting in poor performance when directly performing I2V Re - ID. 2. **Insufficient learning of local structures**: Existing methods pay more attention to the transfer of global information during the knowledge distillation process and ignore the learning of local structures (such as nearest - neighbor relationships), which is particularly important for the Re - ID task. 3. **Heterogeneous output alignment**: There is heterogeneity in the outputs of the teacher network and the student network, resulting in alignment problems when directly comparing or transferring features. ### Solutions: To solve the above problems, the author proposes a new framework - **Mutual Discriminative Knowledge Transfer (MDKT)**, and its core contributions include: 1. **Triplet Contrast Loss (TCL)**: - By introducing TCL, the author designs a new loss function to transfer local discriminative features instead of only focusing on global matching as in traditional methods. - TCL encodes higher - order structural information by measuring the probability distances between the anchor, positive, and negative samples and alleviates the heterogeneity problem of the outputs of the teacher and student networks. - The formula is as follows: \[ p_{apn}^{\tau_2} = \frac{\exp(-d_t^{a2p}/\tau_2)}{\exp(-d_t^{a2p}/\tau_2) + \exp(-d_t^{a2n}/\tau_2)} \] where \( d_t^{a2p} = \| f_t(x_a) - f_t(x_p) \|^2_2 \) and \( d_t^{a2n} = \| f_t(x_a) - f_t(x_n) \|^2_2 \). 2. **Mutual Learning**: - During the training process, not only does the student network learn from the teacher network, but the teacher network also receives feedback from the student network, thereby improving the performance of both. - This two - way learning mechanism helps to better adjust and optimize the parameters of the two networks. 3. **Multi - level distillation loss**: - It includes Mutual Logits Distillation, Pairwise Distance in Embedding, and TCL loss, and these losses work together to improve the generalization ability and discriminative ability of the model. ### Experimental results: The author has conducted extensive experiments on multiple benchmark datasets (such as MARS, DukeMTMC - VideoReID, and VeRi - 776) to verify the effectiveness of the proposed method. The experimental results show that MDKT outperforms existing methods in all metrics, especially with significant improvements in top - 1 accuracy and mAP. In summary, this paper aims to effectively solve the problems of representational gap and insufficient learning of local structures in I2V Re - ID by introducing new loss functions and a mutual learning mechanism, thereby significantly improving the performance of cross - modal re - identification.