End-to-end video text detection with online tracking
Hongyuan Yu,Yan Huang,Lihong Pi,Chengquan Zhang,Xuan Li,Liang Wang
DOI: https://doi.org/10.1016/j.patcog.2020.107791
IF: 8
2021-05-01
Pattern Recognition
Abstract:<p>Text in videos usually acts as important semantic cues, which is helpful to video analysis. Video text detection is considered as one of the most difficult tasks in document analysis due to the following two challenges: 1) the difficulties caused by video scenes, i.e., motion blur, illumination changes, and occlusion; 2) the properties of text including variants of fonts, languages, orientations, and shapes. Most existing methods try to improve the video text detection through video text tracking, but treat these two tasks separately. This can significantly increase the amount of calculations and cannot take full advantage of the supervisory information of both tasks. In this work, we introduce explainable descriptor, combines appearance, geometry and PHOC features, to establish a bridge between detection and tracking and build an end-to-end video text detection model with online tracking to address these challenges together. By integrating these two branches into one trainable framework, they can promote each other and the computational cost is significantly reduced. Besides, the introduce explainable descriptor also make our end-to-end model have inherent interpretability. Experiments on existing video text benchmarks including ICDAR 2013 Video, DOST, Minetto and YVT verify the role of explainable descriptors in improving model expression ability and the proposed method significantly outperforms state-of-the-art methods. Our method improves F-score by more than <span class="math"><math>2%</math></span> on all datasets and achieves <span class="math"><math>81.52%</math></span> on the MOTA of the Minetto dataset.</p>
computer science, artificial intelligence,engineering, electrical & electronic