Abstract:The performance of action recognition in video sequences depends significantly on the representation of actions and the similarity measurement between the representations. In this paper, we combine two kinds of features extracted from the spatio-temporal interest points with context-aware kernels for action recognition. For the action representation, local cuboid features extracted around interest points are very popular using a Bag of Visual Words (BOVW) model. Such representations, however, ignore potentially valuable information about the global spatio-temporal distribution of interest points. We propose a new global feature to capture the detailed geometrical distribution of interest points. It is calculated by using the 3D ℛ transform which is defined as an extended 3D discrete Radon transform, followed by the application of a two-directional two-dimensional principal component analysis. For the similarity measurement, we model a video set as an optimized probabilistic hypergraph and propose a context-aware kernel to measure high order relationships among videos. The context-aware kernel is more robust to the noise and outliers in the data than the traditional context-free kernel which just considers the pairwise relationships between videos. The hyperedges of the hypergraph are constructed based on a learnt Mahalanobis distance metric. Any disturbing information from other classes is excluded from each hyperedge. Finally, a multiple kernel learning algorithm is designed by integrating the l_2 norm regularization into a linear SVM classifier to fuse the ℛ feature and the BOVW representation for action recognition. Experimental results on several datasets demonstrate the effectiveness of the proposed approach for action recognition.

Modeling Geometric-Temporal Context with Directional Pyramid Co-Occurrence for Action Recognition

Learning Visual Context for Group Activity Recognition.

Temporal Distinct Representation Learning for Action Recognition

Beyond Spatial Pyramid Matching: Space-time Extended Descriptor for Action Recognition

Spatio-temporal Laplacian Pyramid Coding for Action Recognition.

Human action recognition using pyramid vocabulary tree

Human Action Recognition under Log-Euclidean Riemannian Metric.

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Spatio-Temporal Proximity Distribution Kernels for Action Recognition

Fusing $${\mathcal {R}}$$R Features and Local Features with Context-Aware Kernels for Action Recognition

Directional Temporal Modeling for Action Recognition

An Attentional Spatial Temporal Graph Convolutional Network with Co-Occurrence Feature Learning for Action Recognition

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Reassessing Hierarchical Representation for Action Recognition in Still Images

3D Action Recognition Using Multi-Temporal Depth Motion Maps and Fisher Vector

Robust 3D Action Recognition Through Sampling Local Appearances and Global Distributions.

Learning Informative Pairwise Joints with Energy-Based Temporal Pyramid for 3D Action Recognition

Efficient Spatialtemporal Context Modeling for Action Recognition

MULTI-DIRECTIONAL CONVOLUTION NETWORKS WITH SPATIAL-TEMPORAL FEATURE PYRAMID MODULE FOR ACTION RECOGNITION

Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition

Combining Sparse And Dense Descriptors With Temporal Semantic Structures For Robust Human Action Recognition