CTAP: Complementary Temporal Action Proposal Generation

Jiyang Gao,Kan Chen,Ram Nevatia
DOI: https://doi.org/10.48550/arXiv.1807.04821
2018-07-19
Abstract:Temporal action proposal generation is an important task, akin to object proposals, temporal action proposals are intended to capture "clips" or temporal intervals in videos that are likely to contain an action. Previous methods can be divided to two groups: sliding window ranking and actionness score grouping. Sliding windows uniformly cover all segments in videos, but the temporal boundaries are imprecise; grouping based method may have more precise boundaries but it may omit some proposals when the quality of actionness score is low. Based on the complementary characteristics of these two methods, we propose a novel Complementary Temporal Action Proposal (CTAP) generator. Specifically, we apply a Proposal-level Actionness Trustworthiness Estimator (PATE) on the sliding windows proposals to generate the probabilities indicating whether the actions can be correctly detected by actionness scores, the windows with high scores are collected. The collected sliding windows and actionness proposals are then processed by a temporal convolutional neural network for proposal ranking and boundary adjustment. CTAP outperforms state-of-the-art methods on average recall (AR) by a large margin on THUMOS-14 and ActivityNet 1.3 datasets. We further apply CTAP as a proposal generation method in an existing action detector, and show consistent significant improvements.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality Temporal Action Proposals in videos. Specifically, the authors focus on improving the accuracy of temporal action proposals, which aim to capture "segments" or time intervals in videos that are likely to contain an action. Existing methods are mainly divided into two categories: the sliding - window ranking method and the action - score grouping method. Although the sliding - window method can evenly cover all parts of the video, its time boundaries are not precise enough; while the action - score - based method may have more precise boundaries, but it may miss some proposals when the action - score quality is low. Therefore, the paper proposes a new Complementary Temporal Action Proposal (CTAP) generator, aiming to combine the advantages of these two methods to generate higher - quality action proposals. CTAP achieves this goal through the following three modules: 1. **Initial Proposal Generation**: In this stage, initial proposals are generated from two sources, one is based on action scores and Temporal Action Grouping (TAG), and the other is the sliding window. 2. **Proposal Complementary Filtering**: Since TAG will miss correct proposals when the action - score quality is low, and the sliding window can evenly cover all parts of the video, a complementary filter is designed to collect high - quality complementary proposals from the sliding window to fill in the proposals missed by TAG. 3. **Proposal Ranking and Boundary Adjustment**: In this stage, a temporal convolutional neural network is used to rank the proposals and adjust the time boundaries, thereby retaining the order information of the proposal boundaries. The paper conducted experiments on the THUMOS - 14 and ActivityNet v1.3 datasets. The results show that CTAP significantly outperforms existing methods in terms of Average Recall (AR), and also shows consistent performance improvement in the action detection task.