Abstract:Many existing methods formulate the action prediction task as recognizing early parts of actions in trimmed videos. In this paper, we focus on predicting actions from ongoing untrimmed videos where actions might not happen at the very beginning of videos. It is extremely challenging to predict actions in such untrimmed videos due to ambiguous or even no information of actions in the early parts of videos. To address this problem, we propose a prediction confidence that assesses the decision quality of a prediction model. Guided by the confidence, the model continuously refines the prediction results by itself with the increasing observed video frames. Specifically, we build a Self Prediction Refining Network (SPR-Net) which incrementally learns the confidence for action prediction. SPR-Net consists of three modules: a temporal hybrid network, an incremental confidence learner, and a self-refining Gumbel softmax sampler. The temporal hybrid network generates the action category distributions by integrating static scene and dynamic motion information. The incremental confidence learner calculates the confidence in an incremental manner, judging the extent to which the temporal hybrid network should believe its prediction result. The self-refining Gumbel softmax sampler models the mutual relationship between the prediction confidence and the category distribution, which enables them to be jointly learned in an end-to-end fashion. We also present a sparse self-attention mechanism to encode local spatio-temporal features into the frame-level motion representation to further improve the prediction performance. Extensive experiments on five datasets (i.e., UT-Interaction, BIT-Interaction, UCF101, THUMOS14, and ActivityNet) validate the effectiveness of the proposed method.

Action Prediction Via Deep Residual Feature Learning and Weighted Loss

Deep Residual Feature Learning for Action Prediction

Action Knowledge Transfer for Action Prediction with Partial Videos

Early Action Prediction with Generative Adversarial Networks

Rich Action-Semantic Consistent Knowledge for Early Action Prediction

Frame-part-activated deep reinforcement learning for Action Prediction

A Discussion of Data Sampling Strategies for Early Action Prediction

DBDNet: Learning Bi-directional Dynamics for Early Action Prediction

A New Depth Residual Network Combined Recurrent with Residual Structure for Human Action Recognition from Videos

Confidence-Guided Self Refinement for Action Prediction in Untrimmed Videos

Action Recognition By Learning Deep Multi-Granular Spatio-Temporal Video Representation

Progressive Teacher-Student Learning For Early Action Prediction

Egocentric Early Action Prediction via Adversarial Knowledge Distillation

End-to-end Video-level Representation Learning for Action Recognition

Collaborative Spatio-temporal Feature Learning for Video Action Recognition

Learning Hierarchical Video Representation for Action Recognition

Spatial–Temporal Context-Aware Online Action Detection and Prediction

Unsupervised Deep Learning of Mid-Level Video Representation for Action Recognition.

Interpretable Deep Feature Propagation for Early Action Recognition

Deep Point-Wise Prediction for Action Temporal Proposal