Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

Di Fu,Thanh Vinh Vo,Haozhe Ma,Tze-Yun Leong
2024-07-20
Abstract:Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the catastrophic forgetting problem in **Continual Action Recognition**. Specifically, the paper focuses on how to make the model maintain the memory of previously learned knowledge while continuously adapting to new video data, so as to achieve efficient continuous learning. ### Background and Challenges 1. **Importance of Continual Learning** - Continual Learning (CL) refers to the ability of a model to learn new information from a continuous data stream while retaining previously learned knowledge. This is crucial for applications in dynamic environments, such as security monitoring, medical care, sports analysis, and human - machine collaboration. 2. **Limitations of Existing Methods** - Most existing CL methods are mainly designed for static images and cannot effectively deal with the high - dimensional characteristics, temporal dependence, and significant changes between sequences of video data. - CL methods specifically for videos usually need to store a large amount of data to support the learning of new classes, resulting in high memory costs. - Using adapters or prompt tuning alone has their respective limitations: adapters perform poorly in rapid task specialization, while prompt tuning has a slow adaptation speed on new tasks and is prone to homogenization. ### Proposed Method To solve the above problems, the authors propose the **Decoupled Prompt - Adapter Tuning (DPAT)** framework. DPAT effectively balances the generalization ability and plasticity of the model by combining the advantages of adapters and prompt tuning and adopting a phased training strategy. ### Main Contributions 1. **Introduction of the DPAT Framework** - This framework aims to improve the performance of pre - trained image encoders in continual action recognition, especially when dealing with spatio - temporal information. 2. **Combination of Pre - trained Vision Transformer (ViT) with Adapters and Prompt Tuning** - Utilize the powerful capabilities of pre - trained models to ensure that old knowledge is not forgotten when learning new tasks and can effectively adapt to new spatio - temporal tasks. 3. **Two - stage Training Strategy** - Stage 1: Establish a stable and general foundation through prefix tuning. - Stage 2: Perform task - specific refinement and adaptation through adapter tuning while maintaining the stability of the initial prompt. 4. **Experimental Verification** - On multiple challenging datasets, DPAT has shown consistent state - of - the - art performance, demonstrating its superiority in fine - grained action recognition tasks. ### Formula Representation Some formulas involved in the paper are as follows: - Definition of the prefix tuning function: \[ f_{\text{Pre - T}}(p, h)=\text{MSA}(h_Q, [p_k; h_K], [p_v; h_V]) \] - Matching loss function (with softmax normalization): \[ L_{\text{match}}(x, k_t)=-\log\left(\frac{e^{-\gamma(q(x), k_t)/\tau}}{\sum_{i = 1}^{t}e^{-\gamma(q(x), k_i)/\tau}}\right) \] - Training objective: \[ \text{Stage 1: }\min_{p_S, p_T, \phi}\mathcal{L}\left(f_\phi\left(f_{p_T, p_S, \theta_T, \theta_S}(x)\right), y\right) \] \[ \text{Stage 2: }\min_{\theta_T, \theta_S, k_t, \phi}\mathcal{L}\left(f_\phi\left(f_{p_T, p_S, \theta_T, \theta_S}(x)\right) \]