Video Text Detection by Attentive Spatiotemporal Fusion of Deep Convolutional Features

Lan Wang,Jiahao Shi,Yang Wang,Feng Su
DOI: https://doi.org/10.1145/3343031.3350868
2019-01-01
Abstract:Scene text in videos carries rich semantic information and plays an important role in various content-based video applications. Compared to text in static images, scene text in videos exhibits some distinct characteristics such as motion blur and temporal redundancy, which bring additional difficulties as well as exploitable clues to the text detection task. In this paper, we propose a novel end-to-end deep neural network for detecting scene text in the video, which combines complementary text features from multiple related frames to enhance the overall detection performance relative to single-frame detection schemes. Specifically, we first extract descriptive features from each video frame using a hierarchical convolutional neural network. Next, we spatiotemporally sample and warp supplementary features from adjacent frames surrounding the current frame using a multi-scale deformable convolution structure. We then aggregate the sampled features with an attention mechanism to adaptively focus on and augment relevant features and generate an enhanced feature representation of the current frame, which is further fed to the prediction network for localizing text candidates. The proposed model achieves state-of-the-art text detection performance on public scene text video datasets, demonstrating the superiority of the proposed multi-frame feature fusion based video text detection scheme to most single-frame and tracking-based detection schemes.
What problem does this paper attempt to address?