Abstract:Given a natural language description, temporal textual localization aims to localize the most relevant segment in an untrimmed video, which is a natural and imperative extension of temporal action localization. Most existing temporal textual localization works neglect the long-range semantic modeling in video contents and lack accurate textual understanding. Moreover, they remain in single-task learning and fail to exploit multi-view supervised information. Based on these observations, we introduce a novel adversarial bi-directional interaction network, which is a global framework to retrieve the target segment directly. Specifically, we propose a bi-directional attention mechanism to build bi-directional information interaction, which captures long-range semantic dependencies from video context and enhances textual representation learning. After localization, we further advise an auxiliary discriminator network to verify the localization result and boost the performance by adversarial training process. We adopt multi-task learning approach to train our model, including: (1) predicting coordinate probability distribution task, which selects start and end frame to localize target segment; (2) predicting frame-level correlation distribution task, which calculates the correlation between frame and description; (3) auxiliary adversarial learning task, which calculates matched score between localization and description to boost the performance. The extensive experiments on ActivityNet Captions and TACoS show the significant effectiveness and efficiency of our method.

Video text detection and localization based on localized generalization error model

A new video text detection method.

A Novel Approach to Text Detection and Extraction from Videos by Discriminative Features and Density

A method for text line detection in natural images

Text Detection Through Multiple-Scale Localization in Video Sequences

Video text detection and segmentation for optical character recognition

Discrete Wavelet Transform and Gradient Difference based approach for text localization in videos

Automatic video superimposed text detection based on Nonsubsampled Contourlet Transform

A Deep Convolutional Deblurring And Detection Neural Network For Localizing Text In Videos

Graphics and Scene Text Classification in Video

Detecting both superimposed and scene text with multiple languages and multiple alignments in video

A New Technique for Multi-Oriented Scene Text Line Detection and Tracking in Video

Video Text Localization with an emphasis on Edge Features

Video Scene Text Frames Categorization for Text Detection and Recognition

A Research on Video Text Tracking and Recognition

A New Deep Wavefront based Model for Text Localization in 3D Video

Temporal Textual Localization in Video Via Adversarial Bi-Directional Interaction Networks

AUTOMATIC DETECTION AND VERIFICATION OF TEXT REGIONS IN NEWS VIDEO FRAMES

A Multi-stage Method for Chinese Text Detection in News Videos

Text Detection Using Delaunay Triangulation in Video Sequence

Intelligent Detection Method of English Text in Natural Scenes in Video