Dynamic Textual Prompt For Rehearsal-free Lifelong Person Re-identification

Hongyu Chen,Bingliang Jiao,Wenxuan Wang,Peng Wang
2024-11-09
Abstract:Lifelong person re-identification attempts to recognize people across cameras and integrate new knowledge from continuous data streams. Key challenges involve addressing catastrophic forgetting caused by parameter updating and domain shift, and maintaining performance in seen and unseen domains. Many previous works rely on data memories to retain prior samples. However, the amount of retained data increases linearly with the number of training domains, leading to continually increasing memory consumption. Additionally, these methods may suffer significant performance degradation when data preservation is prohibited due to privacy concerns. To address these limitations, we propose using textual descriptions as guidance to encourage the ReID model to learn cross-domain invariant features without retaining samples. The key insight is that natural language can describe pedestrian instances with an invariant style, suggesting a shared textual space for any pedestrian images. By leveraging this shared textual space as an anchor, we can prompt the ReID model to embed images from various domains into a unified semantic space, thereby alleviating catastrophic forgetting caused by domain shifts. To achieve this, we introduce a task-driven dynamic textual prompt framework in this paper. This model features a dynamic prompt fusion module, which adaptively constructs and fuses two different textual prompts as anchors. This effectively guides the ReID model to embed images into a unified semantic space. Additionally, we design a text-visual feature alignment module to learn a more precise mapping between fine-grained visual and textual features. We also developed a learnable knowledge distillation module that allows our model to dynamically balance retaining existing knowledge with acquiring new knowledge. Extensive experiments demonstrate that our method outperforms SOTAs under various settings.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the catastrophic forgetting problem in **lifelong person re - identification (LReID)**. Specifically, the goal of the LReID task is to maintain the performance of the model on the datasets it has seen while learning new knowledge under the condition of continuously receiving new data. However, traditional LReID methods usually rely on rehearsal - based methods, which not only increase memory consumption but may also cause privacy issues. To overcome these problems, this paper proposes a framework named **Dynamic Textual Prompt (DTP)**, which uses natural language descriptions as guidance to prompt the ReID model to learn cross - domain invariant features. This method does not need to save previous training samples, thus avoiding memory and privacy problems. The following are the core contributions of the paper: 1. **Introducing the dynamic textual prompt framework**: By mapping image features into a unified semantic space, the catastrophic forgetting problem is effectively alleviated. 2. **Designing the dynamic prompt fusion module (DPF)**: It adaptively generates text prompts to guide the model to embed images from different domains into a unified semantic space. 3. **Introducing the text - visual feature alignment module (TFA)**: It achieves fine - grained alignment of text and visual features, ensuring more accurate feature representation. 4. **Developing the learnable knowledge distillation module (LKD)**: It dynamically adjusts the distillation temperature to balance the relationship between learning new knowledge and retaining existing knowledge. Through these innovations, the DTP framework significantly outperforms existing methods on multiple datasets, especially on unseen datasets. ### Mathematical formula summary - **Dynamic prompt generation**: \[ P_{DP}=\Phi(\Psi(P_{IP}), P_{PKP}) \] where \(P_{DP}\) represents the dynamic prompt, \(P_{IP}\) represents the invariant prompt, \(P_{PKP}\) represents the person knowledge prompt, and \(\Psi\) and \(\Phi\) represent the parameters of the encoder and decoder respectively. - **Global loss function**: \[ L_{\text{global}} = L_{\text{sup}}(P_{DP}\times F_{\text{img}}^T)+L_{\text{sup}}(F_{\text{img}}\times P_{DP}^T) \] where \(L_{\text{sup}}\) is the modified cross - entropy loss, \(F_{\text{img}}\) is the image feature, and \(P_{DP}\) is the dynamic prompt. - **Local loss function**: \[ L_{\text{partial}}=\frac{1}{N}\sum_{i = 1}^{N}\left(1-\frac{p_i\cdot f_i}{\|p_i\|_2\cdot\|f_i\|_2}\right) \] where \(p_i\) and \(f_i\) are the local text and image features respectively, and \(N\) is the number of body parts. - **Learnable knowledge distillation loss**: \[ L_{LKD}=-\sum_i\hat{Y}'(i)_T\log\hat{Y}'(i)_{T - 1} \] \[ \hat{Y}'(i)_T=\frac{(y(i)_T)^{1/(t+\delta_1)}}{\sum_j(y(j)_T)^{1/(t+\delta_1)}}, \quad \hat{Y}'(i)_{T - 1}=\frac{(y(i)_{T - 1})^{1/(t+\delta_2)}}{\sum_j(y(j)_{T - 1})^{1/(t+\delta_2)}} \]