Confidence-Guided Self Refinement for Action Prediction in Untrimmed Videos

Jingyi Hou,Xinxiao Wu,Ruiqi Wang,Jiebo Luo,Yunde Jia
DOI: https://doi.org/10.1109/tip.2020.2987425
IF: 10.6
2020-01-01
IEEE Transactions on Image Processing
Abstract:Many existing methods formulate the action prediction task as recognizing early parts of actions in trimmed videos. In this paper, we focus on predicting actions from ongoing untrimmed videos where actions might not happen at the very beginning of videos. It is extremely challenging to predict actions in such untrimmed videos due to ambiguous or even no information of actions in the early parts of videos. To address this problem, we propose a prediction confidence that assesses the decision quality of a prediction model. Guided by the confidence, the model continuously refines the prediction results by itself with the increasing observed video frames. Specifically, we build a Self Prediction Refining Network (SPR-Net) which incrementally learns the confidence for action prediction. SPR-Net consists of three modules: a temporal hybrid network, an incremental confidence learner, and a self-refining Gumbel softmax sampler. The temporal hybrid network generates the action category distributions by integrating static scene and dynamic motion information. The incremental confidence learner calculates the confidence in an incremental manner, judging the extent to which the temporal hybrid network should believe its prediction result. The self-refining Gumbel softmax sampler models the mutual relationship between the prediction confidence and the category distribution, which enables them to be jointly learned in an end-to-end fashion. We also present a sparse self-attention mechanism to encode local spatio-temporal features into the frame-level motion representation to further improve the prediction performance. Extensive experiments on five datasets (i.e., UT-Interaction, BIT-Interaction, UCF101, THUMOS14, and ActivityNet) validate the effectiveness of the proposed method.
What problem does this paper attempt to address?