Detecting Temporal Proposal for Action Localization with Tree-structured Search Policy

Xinyang Jiang,Siliang Tang,Yang,Zhou Zhao,Yin Zhang,Fei Wu,Yueting Zhuang
DOI: https://doi.org/10.1145/3123266.3123362
2017-01-01
Abstract:Understanding the semantics in videos is a complex but crucial task in video analysis. This paper focuses on localizing category-independent events, actions or other semantics in an untrimmed video, referred as salient temporal proposal localization. Traditional methods like sliding window have a high computational cost due to the densely sampling of different video segments. We propose a reinforcement learning based method, which trains a localizer that learns a search policy that, instead of exploring every video segment, finds an optimal search path to locate a salient proposal based on the currently observing video segment in a tree structure, therefore reduces the number of video segments fed into the proposal detector. In each search step, a localizer is trained to iteratively select the next sub-region containing salient proposals to continue the search, and a proposal detector is trained to recognize salient proposal from the sub-regions. The experiments demonstrate that our method is able to precisely detect salient proposals with a comparable recall and with much fewer candidate windows.
What problem does this paper attempt to address?