Towards Robust Video Text Detection with Spatio-Temporal Attention Modeling and Text Cues Fusion

Long Chen,Feng Su
DOI: https://doi.org/10.1109/ICME52920.2022.9859582
2022-01-01
Abstract:Information carried by video text is of great value to various video applications. However, detecting text in videos of- ten faces great challenges due to the widely varied appearance of text and the complicated, dynamic video context. In this paper, we propose a robust video text detection network that adaptively combines relevant text cues in multiple frames with spatio-temporal attention and fusion mechanisms, which effectively enhance the accuracy and robustness of video text detection compared to single-frame detection. The network first localizes text region proposals and propagates them across frames with an R-CNN based framework. Then, a Transformer-based cross-frame feature fusion model is employed to attentively select and combine relevant text features, yielding an enhanced representation of text region integrating complementary text cues for robust text candidate prediction. The network achieves competitive text detection performance on standard video text benchmarks, demonstrating the effectiveness of the proposed method.
What problem does this paper attempt to address?