Abstract:Going beyond few-shot action recognition (FSAR), cross-domain FSAR (CDFSAR) has attracted recent research interests by solving the domain gap lying in source-to-target transfer learning. Existing CDFSAR methods mainly focus on joint training of source and target data to mitigate the side effect of domain gap. However, such kind of methods suffer from two limitations: First, pair-wise joint training requires retraining deep models in case of one source data and multiple target ones, which incurs heavy computation cost, especially for large source and small target data. Second, pre-trained models after joint training are adopted to target domain in a straightforward manner, hardly taking full potential of pre-trained models and then limiting recognition performance. To overcome above limitations, this paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Specifically, our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source. To effectively and efficiently explore the potential of pre-trained models in transferring to target domain, our TAMT proposes a Hierarchical Temporal Tuning Network (HTTN), whose core involves local temporal-aware adapters (TAA) and a global temporal-aware moment tuning (GTMT). Particularly, TAA learns few parameters to recalibrate the intermediate features of frozen pre-trained models, enabling efficient adaptation to target domains. Furthermore, GTMT helps to generate powerful video representations, improving match performance on the target domain. Experiments on several widely used video benchmarks show our TAMT outperforms the recently proposed counterparts by 13%$\sim$31%, achieving new state-of-the-art CDFSAR results.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems in **Cross - Domain Few - Shot Action Recognition (CDFSAR)**: 1. **The problem of high computational cost and frequent retraining**: - Existing CDFSAR methods usually adopt a joint training method, that is, training on both source - domain and target - domain data simultaneously. When there is one source domain and multiple target domains, this method requires retraining the model for each target domain, which will lead to high computational costs, especially when the amount of source - domain data is large and the amount of target - domain data is small. 2. **The problem of failing to fully utilize the potential of pre - trained models**: - In the inference stage, existing methods usually directly apply pre - trained models to the target domain, using simple nearest - neighbor classifiers or fine - tuned classifiers. This way fails to fully tap the potential of pre - trained models, thus limiting the final recognition performance. To solve the above problems, the authors propose a new method named **Temporal - Aware Model Tuning (TAMT)**. Specifically, TAMT solves these problems in the following ways: - **Decoupled training paradigm**: TAMT first pre - trains the model on the source - domain data and then fine - tunes it on the target - domain data, avoiding the high computational cost caused by multiple retrainings. - **Hierarchical Temporal - Aware Tuning Network (HTTN)**: In order to use the pre - trained model more effectively, TAMT introduces HTTN, which includes Temporal - Aware Adapter (TAA) and Global Temporal - Aware Moment Tuning (GTMT). These components can recalibrate intermediate features and generate powerful video representations, thereby improving the matching performance on the target domain. ### Main contributions of the paper 1. **Proposing a decoupled training paradigm**: This is the first attempt to introduce a decoupled training paradigm in the CDFSAR task, effectively avoiding the problem of frequent retraining in the case of one source domain and multiple target domains. 2. **Designing a lightweight HTTN**: Through TAA and GTMT, HTTN can significantly improve the adaptability of the pre - trained model on the target domain while maintaining high efficiency. 3. **Experimentally verifying the effectiveness of the method**: Experiments on multiple video benchmark datasets show that TAMT significantly outperforms existing CDFSAR methods and has a lower training cost. ### Presentation of formulas in Markdown format Some of the formulas involved in the paper are as follows: 1. **Feature recalibration formula of the TAA module**: \[ F'=\gamma\odot F\oplus\beta \] where $\odot$ and $\oplus$ represent element - wise multiplication and addition operations respectively. 2. **Parameter calculation formula of the TAA module**: \[ \gamma = W(bF)=W_{\gamma}^{\uparrow}\ast g_{1}(W_{\gamma}^{\downarrow}\ast bF) \] \[ \beta = G(bF)=W_{\beta}^{\uparrow}\ast g_{2}(W_{\beta}^{\downarrow}\ast bF) \] where $bF$ is the global average pooling output of $F$, $\ast$ represents the convolution operation, and $g_{1}$ and $g_{2}$ are activation functions respectively. 3. **Representation formula of feature moments**: \[ Z:=\Phi_{X}(u)=1+\sum_{p = 1}^{\infty}\alpha_{p}M_{p} \] where $M_{p}$ represents the $p$-th moment of $X$ and $\alpha_{p}$ is a coefficient. 4. **Calculation formulas of the first - order and second - order moments**:

TAMT: Temporal-Aware Model Tuning for Cross-Domain Few-Shot Action Recognition

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models

On the Importance of Spatial Relations for Few-shot Action Recognition

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition

Task-specific alignment and multiple-level transformer for few-shot action recognition

Temporal Distinct Representation Learning for Action Recognition

Learning Causal Domain-Invariant Temporal Dynamics for Few-Shot Action Recognition

Motion-modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition

DTCM: Joint Optimization of Dark Enhancement and Action Recognition in Videos

A Pairwise Attentive Adversarial Spatiotemporal Network for Cross-Domain Few-Shot Action Recognition-R2.

Task-Adapter: Task-specific Adaptation of Image Models for Few-shot Action Recognition

Task-Aware Dual-Representation Network for Few-Shot Action Recognition

DMSD-CDFSAR: Distillation from Mixed-Source Domain for Cross-Domain Few-shot Action Recognition

Task-adaptive Spatial-Temporal Video Sampler for Few-shot Action Recognition

Learnable Feature Augmentation Framework for Temporal Action Localization

Temporal Segment Networks for Action Recognition in Videos

TACDFSL: Task Adaptive Cross Domain Few-Shot Learning

Temporal-Spatial Mapping for Action Recognition