Abstract:The gap in representations between image and video makes Image-to-Video Re-identification (I2V Re-ID) challenging, and recent works formulate this problem as a knowledge distillation (KD) process. In this paper, we propose a mutual discriminative knowledge distillation framework to transfer a video-based richer representation to an image based representation more effectively. Specifically, we propose the triplet contrast loss (TCL), a novel loss designed for KD. During the KD process, the TCL loss transfers the local structure, exploits the higher order information, and mitigates the misalignment of the heterogeneous output of teacher and student networks. Compared with other losses for KD, the proposed TCL loss selectively transfers the local discriminative features from teacher to student, making it effective in the ReID. Besides the TCL loss, we adopt mutual learning to regularize both the teacher and student networks training. Extensive experiments demonstrate the effectiveness of our method on the MARS, DukeMTMC-VideoReID and VeRi-776 benchmarks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the Image - to - Video Re - ID (I2V Re - ID) problem of cross - modal re - identification. Specifically, the representational differences between images and videos make the I2V Re - ID task challenging. Existing methods usually regard this problem as a Knowledge Distillation (KD) process, but these methods mainly focus on global matching and ignore the learning of local structures. ### Main problems: 1. **Representational gap**: There are significant differences in feature representations between images and videos, resulting in poor performance when directly performing I2V Re - ID. 2. **Insufficient learning of local structures**: Existing methods pay more attention to the transfer of global information during the knowledge distillation process and ignore the learning of local structures (such as nearest - neighbor relationships), which is particularly important for the Re - ID task. 3. **Heterogeneous output alignment**: There is heterogeneity in the outputs of the teacher network and the student network, resulting in alignment problems when directly comparing or transferring features. ### Solutions: To solve the above problems, the author proposes a new framework - **Mutual Discriminative Knowledge Transfer (MDKT)**, and its core contributions include: 1. **Triplet Contrast Loss (TCL)**: - By introducing TCL, the author designs a new loss function to transfer local discriminative features instead of only focusing on global matching as in traditional methods. - TCL encodes higher - order structural information by measuring the probability distances between the anchor, positive, and negative samples and alleviates the heterogeneity problem of the outputs of the teacher and student networks. - The formula is as follows: \[ p_{apn}^{\tau_2} = \frac{\exp(-d_t^{a2p}/\tau_2)}{\exp(-d_t^{a2p}/\tau_2) + \exp(-d_t^{a2n}/\tau_2)} \] where \( d_t^{a2p} = \| f_t(x_a) - f_t(x_p) \|^2_2 \) and \( d_t^{a2n} = \| f_t(x_a) - f_t(x_n) \|^2_2 \). 2. **Mutual Learning**: - During the training process, not only does the student network learn from the teacher network, but the teacher network also receives feedback from the student network, thereby improving the performance of both. - This two - way learning mechanism helps to better adjust and optimize the parameters of the two networks. 3. **Multi - level distillation loss**: - It includes Mutual Logits Distillation, Pairwise Distance in Embedding, and TCL loss, and these losses work together to improve the generalization ability and discriminative ability of the model. ### Experimental results: The author has conducted extensive experiments on multiple benchmark datasets (such as MARS, DukeMTMC - VideoReID, and VeRi - 776) to verify the effectiveness of the proposed method. The experimental results show that MDKT outperforms existing methods in all metrics, especially with significant improvements in top - 1 accuracy and mAP. In summary, this paper aims to effectively solve the problems of representational gap and insufficient learning of local structures in I2V Re - ID by introducing new loss functions and a mutual learning mechanism, thereby significantly improving the performance of cross - modal re - identification.

Image-to-Video Re-Identification via Mutual Discriminative Knowledge Transfer

Instance Hard Triplet Loss for In-video Person Re-identification

Pose-Guided Feature Learning with Knowledge Distillation for Occluded Person Re-Identification.

Dual Knowledge Distillation on Multiview Pseudo Labels for Unsupervised Person Re-Identification

Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification

Temporal Complementarity-Guided Reinforcement Learning for Image-to-Video Person Re-Identification

Knowledge self-distillation for visible-infrared cross-modality person re-identification

Online Knowledge Distillation Via Mutual Contrastive Learning for Visual Recognition

ViTKD: Feature-based Knowledge Distillation for Vision Transformers

Relevance Transfer: Towards Robust Distillation in Person Re-Identification

Improving Knowledge Distillation Via Head and Tail Categories

Triplet Knowledge Distillation

Patch-based Knowledge Distillation for Lifelong Person Re-Identification

Relational Representation Distillation

Hybrid mix-up contrastive knowledge distillation

Multiloss Joint Gradient Control Knowledge Distillation for Image Classification

Collaborative Knowledge Distillation Via Multiknowledge Transfer.

MDR: Multi-stage Decoupled Relational Knowledge Distillation with Adaptive Stage Selection

Improving Knowledge Distillation With a Customized Teacher

Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching

From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels