Abstract:Action recognition technology plays a vital role in enhancing security through surveillance systems, enabling better patient monitoring in healthcare, providing in-depth performance analysis in sports, and facilitating seamless human-AI collaboration in domains such as manufacturing and assistive technologies. The dynamic nature of data in these areas underscores the need for models that can continuously adapt to new video data without losing previously acquired knowledge, highlighting the critical role of advanced continual action recognition. To address these challenges, we propose Decoupled Prompt-Adapter Tuning (DPAT), a novel framework that integrates adapters for capturing spatial-temporal information and learnable prompts for mitigating catastrophic forgetting through a decoupled training strategy. DPAT uniquely balances the generalization benefits of prompt tuning with the plasticity provided by adapters in pretrained vision models, effectively addressing the challenge of maintaining model performance amidst continuous data evolution without necessitating extensive finetuning. DPAT consistently achieves state-of-the-art performance across several challenging action recognition benchmarks, thus demonstrating the effectiveness of our model in the domain of continual action recognition.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the catastrophic forgetting problem in **Continual Action Recognition**. Specifically, the paper focuses on how to make the model maintain the memory of previously learned knowledge while continuously adapting to new video data, so as to achieve efficient continuous learning. ### Background and Challenges 1. **Importance of Continual Learning** - Continual Learning (CL) refers to the ability of a model to learn new information from a continuous data stream while retaining previously learned knowledge. This is crucial for applications in dynamic environments, such as security monitoring, medical care, sports analysis, and human - machine collaboration. 2. **Limitations of Existing Methods** - Most existing CL methods are mainly designed for static images and cannot effectively deal with the high - dimensional characteristics, temporal dependence, and significant changes between sequences of video data. - CL methods specifically for videos usually need to store a large amount of data to support the learning of new classes, resulting in high memory costs. - Using adapters or prompt tuning alone has their respective limitations: adapters perform poorly in rapid task specialization, while prompt tuning has a slow adaptation speed on new tasks and is prone to homogenization. ### Proposed Method To solve the above problems, the authors propose the **Decoupled Prompt - Adapter Tuning (DPAT)** framework. DPAT effectively balances the generalization ability and plasticity of the model by combining the advantages of adapters and prompt tuning and adopting a phased training strategy. ### Main Contributions 1. **Introduction of the DPAT Framework** - This framework aims to improve the performance of pre - trained image encoders in continual action recognition, especially when dealing with spatio - temporal information. 2. **Combination of Pre - trained Vision Transformer (ViT) with Adapters and Prompt Tuning** - Utilize the powerful capabilities of pre - trained models to ensure that old knowledge is not forgotten when learning new tasks and can effectively adapt to new spatio - temporal tasks. 3. **Two - stage Training Strategy** - Stage 1: Establish a stable and general foundation through prefix tuning. - Stage 2: Perform task - specific refinement and adaptation through adapter tuning while maintaining the stability of the initial prompt. 4. **Experimental Verification** - On multiple challenging datasets, DPAT has shown consistent state - of - the - art performance, demonstrating its superiority in fine - grained action recognition tasks. ### Formula Representation Some formulas involved in the paper are as follows: - Definition of the prefix tuning function: \[ f_{\text{Pre - T}}(p, h)=\text{MSA}(h_Q, [p_k; h_K], [p_v; h_V]) \] - Matching loss function (with softmax normalization): \[ L_{\text{match}}(x, k_t)=-\log\left(\frac{e^{-\gamma(q(x), k_t)/\tau}}{\sum_{i = 1}^{t}e^{-\gamma(q(x), k_i)/\tau}}\right) \] - Training objective: \[ \text{Stage 1: }\min_{p_S, p_T, \phi}\mathcal{L}\left(f_\phi\left(f_{p_T, p_S, \theta_T, \theta_S}(x)\right), y\right) \] \[ \text{Stage 2: }\min_{\theta_T, \theta_S, k_t, \phi}\mathcal{L}\left(f_\phi\left(f_{p_T, p_S, \theta_T, \theta_S}(x)\right) \]

Decoupled Prompt-Adapter Tuning for Continual Activity Recognition

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding

DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition

D^2ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

SAM-Adapter: Adapting Segment Anything in Underperformed Scenes

Dynamic Prompt Allocation and Tuning for Continual Test-Time Adaptation

M-adapter: Multi-level image-to-video adaptation for video action recognition

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation

Modal-aware Prompt Tuning with Deep Adaptive Feature Enhancement

Adapting CLIP for Action Recognition via Dual Semantic Supervision and Temporal Prompt Reparameterization

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

Dynamic Prompting: A Unified Framework for Prompt Tuning

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models

VPA: Fully Test-Time Visual Prompt Adaptation

Robust Test-Time Adaptation for Zero-Shot Prompt Tuning

DePT: Decoupled Prompt Tuning

Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis

Bi-directional Adapter for Multi-modal Tracking

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Beyond Prompt Learning: Continual Adapter for Efficient Rehearsal-Free Continual Learning