Incorporating Self-attention Mechanism and Multi-task Learning into Scene Text Detection

Ning Ding,Liangrui Peng,Changsong Liu,Yuqi Zhang,Ruixue Zhang,Jie Li
DOI: https://doi.org/10.1007/978-3-031-25069-9_21
2023-01-01
Abstract:In recent years, Mask R-CNN based methods have achieved promising performance on scene text detection tasks. This paper proposes to incorporate self-attention mechanism and multi-task learning into Mask R-CNN based scene text detection frameworks. For the backbone, self-attention-based Swin Transformer is adopted to replace the original backbone of ResNet, and a composite network scheme is further utilized to combine two Swin Transformer networks as a backbone. For the detection heads, a multi-task learning method by using cascade refinement structure for text/non-text classification, bounding box regression, mask prediction and text line recognition is proposed. Experiments are carried out on the ICDAR MLT 2017 & 2019 datasets, which show that the proposed method has achieved improved performance.
What problem does this paper attempt to address?