Temporal Action Proposal Generation Via Multi-Task Feature Learning.

Handong Ma,Lixin Duan
DOI: https://doi.org/10.1109/vcip49819.2020.9301824
2020-01-01
Abstract:Temporal action proposal generation is an active topic in computer vision and image/video processing communities, which aims to predict a set of temporal proposals to discover all action instances with high recall and intersection over union (IoU) in real-world untrimmed videos. Many previous approaches rely on the two-stream network which intends to simultaneously extract spatial (appearance) and temporal features for representing each video. However, few work considers to capture the correlations between the two kinds of features, leaving a large room for improving the model. In this paper, we present a new method for generating temporal action proposals based on multi-task feature learning. Specifically, we aim to learn shared representation between the spatial and temporal features in a multi-task learning framework, so as to acquire a compact and precise feature representation. Moreover, we devise a correlation loss to address the ‘weak-correlation' problem with high IoUs but low confidences cores. Finally, we take an ensemble learning strategy in order to inherit the advantages of existing works. Extensive experimental results on the ActivityNet-1.3 challenge dataset show that the proposed method achieves the best performance, compared with the state-of-the-arts reported in the literature and the official leaderboard. Our code will be released soon.
What problem does this paper attempt to address?