A Multi-Object Rectified Attention Network for Scene Text Recognition

Canjie Luo,Lianwen Jin,Zenghui Sun
DOI: https://doi.org/10.48550/arXiv.1901.03003
2019-01-10
Abstract:Irregular text is widely used. However, it is considerably difficult to recognize because of its various shapes and distorted patterns. In this paper, we thus propose a multi-object rectified attention network (MORAN) for general scene text recognition. The MORAN consists of a multi-object rectification network and an attention-based sequence recognition network. The multi-object rectification network is designed for rectifying images that contain irregular text. It decreases the difficulty of recognition and enables the attention-based sequence recognition network to more easily read irregular text. It is trained in a weak supervision way, thus requiring only images and corresponding text labels. The attention-based sequence recognition network focuses on target characters and sequentially outputs the predictions. Moreover, to improve the sensitivity of the attention-based sequence recognition network, a fractional pickup method is proposed for an attention-based decoder in the training phase. With the rectification mechanism, the MORAN can read both regular and irregular scene text. Extensive experiments on various benchmarks are conducted, which show that the MORAN achieves state-of-the-art performance. The source code is available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in scene text recognition, irregular texts are difficult to be accurately recognized due to their variable shapes and distortion patterns. Specifically, the paper proposes a Multi - Object Rectification Attention Network (MORAN) for general scene text recognition. MORAN consists of a Multi - Object Rectification Network (MORN) and an Attention - based Sequence Recognition Network (ASRN). MORN aims to rectify images containing irregular texts and reduce the recognition difficulty; while ASRN focuses on the target characters and outputs the prediction results sequentially. In addition, in order to improve the sensitivity of the attention - based sequence recognition network, the paper also proposes a score - picking method to optimize the attention decoder during the training phase. Through these mechanisms, MORAN can read regular and irregular scene texts and has achieved state - of - the - art performance in multiple benchmark tests. The main contributions of the paper include: 1. Proposing the MORAN framework for recognizing irregular scene texts. This framework contains a Multi - Object Rectification Network (MORN) and an Attention - based Sequence Recognition Network (ASRN). The images rectified by MORN are more easily recognized by ASRN. 2. MORN is trained in a weakly - supervised manner, which is flexible and not restricted by geometric constraints, and can rectify images with complex deformations. 3. Proposing a score - picking method for training the attention decoder in ASRN, which improves the robustness to context changes. 4. Proposing a curriculum learning strategy to enable MORAN to learn efficiently. Through training with this strategy, MORAN has surpassed existing methods on multiple standard text recognition benchmark datasets, including IIIT5K, SVT, ICDAR2003, ICDAR2013, ICDAR2015, SVT - Perspective and CUTE80 datasets.