Abstract:The performance of action recognition in video sequences depends significantly on the representation of actions and the similarity measurement between the representations. In this paper, we combine two kinds of features extracted from the spatio-temporal interest points with context-aware kernels for action recognition. For the action representation, local cuboid features extracted around interest points are very popular using a Bag of Visual Words (BOVW) model. Such representations, however, ignore potentially valuable information about the global spatio-temporal distribution of interest points. We propose a new global feature to capture the detailed geometrical distribution of interest points. It is calculated by using the 3D ℛ transform which is defined as an extended 3D discrete Radon transform, followed by the application of a two-directional two-dimensional principal component analysis. For the similarity measurement, we model a video set as an optimized probabilistic hypergraph and propose a context-aware kernel to measure high order relationships among videos. The context-aware kernel is more robust to the noise and outliers in the data than the traditional context-free kernel which just considers the pairwise relationships between videos. The hyperedges of the hypergraph are constructed based on a learnt Mahalanobis distance metric. Any disturbing information from other classes is excluded from each hyperedge. Finally, a multiple kernel learning algorithm is designed by integrating the l_2 norm regularization into a linear SVM classifier to fuse the ℛ feature and the BOVW representation for action recognition. Experimental results on several datasets demonstrate the effectiveness of the proposed approach for action recognition.

Global and Local Discriminative Patches Exploiting for Action Recognition

Learning SpatioTemporal and Motion Features in a Unified 2D Network for Action Recognition

A Method of Simultaneously Action Recognition and Video Segmentation of Video Streams.

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Human Action Recognition Based on Three-Stream Network with Frame Sequence Features

B2C-AFM: Bi-Directional Co-Temporal and Cross-Spatial Attention Fusion Model for Human Action Recognition.

Fusing $${\mathcal {R}}$$R Features and Local Features with Context-Aware Kernels for Action Recognition

Learning and Distillating the Internal Relationship of Motion Features in Action Recognition.

Multi-scale residual network model combined with Global Average Pooling for action recognition

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Discriminative Segment Focus Network for Fine-grained Video Action Recognition

Residual Frames with Efficient Pseudo-3D CNN for Human Action Recognition

Local Feature Analysis for real-time Action Recognition.

Action Recognition with Multi-stream Motion Modeling and Mutual Information Maximization

Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition

Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling

Semi-supervised human action recognition via dual-stream cross-fusion and class-aware memory bank

Action Recognition from Depth Sequences Using Weighted Fusion of 2D and 3D Auto-Correlation of Gradients Features

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

Learning Discriminative Features for Fast Frame-Based Action Recognition.

Action-Stage Emphasized Spatiotemporal VLAD for Video Action Recognition