Abstract:This study addresses the difficulty of localizing the moment in long untrimmed video with the help of natural language query called temporal moment localization using natural language (TMLNL). Existing research correlate the query with either a video frame or a moment fragment, neglecting the specific composition of the moment. In this work, we unravel the similar moment overlapping in TMLNL with start/end boundaries of the activity, as well as introduce referencing expression in TMLNL for the first time, which helps to tie the visual expression in a unique way with the textual expression. We present a novel method called stop overlapping in temporal moment localization using natural language (SOL-TMLNL) that addresses the solution for similar moment overlapping in videos by combining the boundary level word interactions with moment context feature. For cross-modal relations, we interact the sentence-level representation with the visual frame-wise feature, whereas the word-level representation interacts with moment boundary feature. We use referencing expression in TMLNL to boost up the object detection on the bases of subject, position, and relationship. The recommended solution outperforms the current state-of-the-art methods on three benchmarks (Charades STA, ActivityNet-Captions, and TACoS).

Moment Overlapping in Temporal Moment Localization in Videos Using Natural Language