DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization.

Chujie Lu,Long Chen,Chilie Tan,Xiaolin Li,Jun Xiao
DOI: https://doi.org/10.18653/v1/d19-1518
2019-01-01
Abstract:In this paper, we focus on natural language video localization: localizing (ie, grounding) a natural language description in a long and untrimmed video sequence. All currently published models for addressing this problem can be categorized into two types: (i) top-down approach: it does classification and regression for a set of pre-cut video segment candidates; (ii) bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries (ie, start and end time point). However, both two approaches suffer several limitations: the former is computation-intensive for densely placed candidates, while the latter has trailed the performance of the top-down counterpart thus far. To this end, we propose a novel dense bottom-up framework: DEnse Bottom-Up Grounding (DEBUG). DEBUG regards all frames falling in the ground truth segment as foreground, and each foreground frame regresses the unique distances from its location to bi-directional ground truth boundaries. Extensive experiments on three challenging benchmarks (TACoS, Charades-STA, and ActivityNet Captions) show that DEBUG is able to match the speed of bottom-up models while surpassing the performance of the state-of-the-art top-down models.
What problem does this paper attempt to address?