Abstract:The number of categories for action recognition is growing rapidly and it has become increasingly hard to label sufficient training data for learning conventional models for all categories. Instead of collecting ever more data and labelling them exhaustively for all categories, an attractive alternative approach is “zero-shot learning” (ZSL). To that end, in this study we construct a mapping between visual features and a semantic descriptor of each action category, allowing new categories to be recognised in the absence of any visual training data. Existing ZSL studies focus primarily on still images, and attribute-based semantic representations. In this work, we explore word-vectors as the shared semantic space to embed videos and category labels for ZSL action recognition. This is a more challenging problem than existing ZSL of still images and/or attributes, because the mapping between video space-time features of actions and the semantic space is more complex and harder to learn for the purpose of generalising over any cross-category domain shift. To solve this generalisation problem in ZSL action recognition, we investigate a series of synergistic strategies to improve upon the standard ZSL pipeline. Most of these strategies are transductive in nature which means access to testing data in the training phase. First, we enhance significantly the semantic space mapping by proposing manifold-regularized regression and data augmentation strategies. Second, we evaluate two existing post processing strategies (transductive self-training and hubness correction), and show that they are complementary. We evaluate extensively our model on a wide range of human action datasets including HMDB51, UCF101, Olympic Sports and event datasets including CCV and TRECVID MED 13. The results demonstrate that our approach achieves the state-of-the-art zero-shot action recognition performance with a simple and efficient pipeline, and without supervised annotation of attributes. Finally, we present in-depth analysis into why and when zero-shot works, including demonstrating the ability to predict cross-category transferability in advance.

Learning Temporal Information and Object Relation for Zero-Shot Action Recognition

Zero-Shot Detection with Transferable Object Proposal Mechanism.

Context-Guided Super-Class Inference for Zero-Shot Detection

Revisiting the Spatial and Temporal Modeling for Few-shot Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

GAN for Vision, KG for Relation: a Two-stage Deep Network for Zero-shot Action Recognition

An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition

Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition.

Transductive Zero-Shot Action Recognition by Word-Vector Embedding

I Know the Relationships: Zero-Shot Action Recognition Via Two-Stream Graph Convolutional Networks and Knowledge Graphs.

Logic-guided Semantic Representation Learning for Zero-Shot Relation Classification.

On the Importance of Spatial Relations for Few-shot Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Temporal Distinct Representation Learning for Action Recognition

Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition

Hierarchical Temporal Memory Enhanced One-Shot Distance Learning for Action Recognition

Zero-shot Skeleton-based Action Recognition via Mutual Information Estimation and Maximization

Semantic Embedding Space for Zero-Shot Action Recognition

Learning Latent Semantic Attributes for Zero-Shot Object Detection.

Zero-Shot Skeleton-based Action Recognition with Dual Visual-Text Alignment

Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths.