Abstract:The canonical video action recognition methods usually label categories with numbers or one-hot vectors and train neural networks to classify a fixed set of predefined categories, thereby constraining their ability to recognise complex actions and transferable ability to unseen concepts. In contrast, cross-modal learning can improve the performance of individual modalities. Based on the facts that a better action recogniser can be built by reading the statements used to describe actions, we exploited the recent multimodal foundation model CLIP for action recognition. In this study, an effective Vision-Language action recognition adaptation was implemented based on few-shot examples spanning different modalities. We added semantic information to action categories by treating textual and visual label as training examples for action classifier construction rather than simply labelling them with numbers. Due to the different importance of words in text and video frames, simply averaging all sequential features may result in ignoring keywords or key video frames. To capture sequential and hierarchical representation, a weighted token-wise interaction mechanism was employed to exploit the pair-wise correlations adaptively. Extensive experiments with public datasets show that cross-modal action recognition learning helps for downstream action images classification, in other words, the proposed method can train better action classifiers by reading the sentences describing action itself. The method proposed in this study not only reaches good generalisation and zero-shot/few-shot transfer ability on Out of Distribution (OOD) test sets, but also performs lower computational complexity due to the lightweight interaction mechanism with 84.15% Top-1 accuracy on the Kinetics-400.

Cross-modality Online Distillation for Multi-View Action Recognition

Multi-layer Representation for Cross-view Action Recognition

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

View-invariant Human Action Recognition Via Robust Locally Adaptive Multi-View Learning

Multi-view Distillation based on Multi-modal Fusion for Few-shot Action Recognition(CLIP-$\mathrm{M^2}$DF)

Annealing Temporal-Spatial Contrastive Learning for Multi-View Online Action Detection

CMD: Self-supervised 3D Action Representation Learning with Cross-Modal Mutual Distillation

Modality Distillation with Multiple Stream Networks for Action Recognition

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

DVANet: Disentangling View and Action Features for Multi-View Action Recognition

Multimodal Distillation for Egocentric Action Recognition

Interactive Learning of a Dual Convolution Neural Network for Multi-Modal Action Recognition

Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection

Multi-modal Instance Refinement for Cross-domain Action Recognition

Hierarchically Learned View-Invariant Representations for Cross-View Action Recognition

Multi-Task Learning of Generalizable Representations for Video Action Recognition

Cross-View Gait Recognition Method Based on Multi-Teacher Joint Knowledge Distillation

Discriminative virtual views for cross-view action recognition

Multi-view key information representation and multi-modal fusion for single-subject routine action recognition

Multi-Modality Multi-Task Recurrent Neural Network for Online Action Detection

Cross-view Action Modeling, Learning and Recognition