S5TR: Simple Single Stage Sequencer for Scene Text Recognition.

Zhijian Wu,Jun Li,Jianhua Xu
DOI: https://doi.org/10.1007/978-981-99-8391-9_11
2024-01-01
Abstract:As an active research topic in computer vision, scene text recognition (STR) aims to recognize character sequences in natural scenes. Currently, mainstream STR approaches consist of two main modules: a visual model for feature extraction and a sequence model for text translation. The two modules function separately and sequentially, which increases the complexity of the STR model. In this study, we propose a novel Simple Single Stage Sequencer for Scene Text Recognition (S5TR), which allows to transform text instance images into string sequences directly. Specifically, our S5TR contains stacks of Sequencers made of horizontal and vertical Long Short Term Memory Networks (LSTMs). On the one hand, S5TR extracts visual representations of images by modeling long-range dependencies via LSTM, which is similar to self-attention in Vision Transformer (ViT). On the other hand, LSTM serving as a sequence modeling module is able to capture contextual information within the character sequence for predicting the character. Experimental results demonstrate that our S5TR achieves highly competitive performance compared to existing STR methods.
What problem does this paper attempt to address?