LOCR: Location-Guided Transformer for Optical Character Recognition

Yu Sun,Dongzhan Zhou,Chen Lin,Conghui He,Wanli Ouyang,Han-Sen Zhong
2024-03-04
Abstract:Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD)
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper aims to address the accuracy issues of Optical Character Recognition (OCR) in academic documents, especially the repetition problem when dealing with complex layouts. Specifically: 1. **Repetition Problem in Complex Layouts**: Existing end-to-end OCR methods, while improving accuracy, tend to have repetition issues in documents with complex layouts, particularly in out-of-domain (OOD) documents. 2. **Position Awareness**: The paper points out that positional information is crucial for text decoding. In complex layouts, it is challenging for the model to accurately capture the positional information of all content. 3. **Interactive OCR Mode**: To further enhance the robustness and accuracy of the model, the paper introduces an interactive OCR mode that allows users to assist the model in generating complex documents through simple positional hints. In summary, the paper proposes a Location-Guided Transformer model (LOCR) that aims to solve the repetition problem encountered by existing methods in complex layout documents through the integration of positional information. The model's performance is further improved through large-scale datasets and an interactive mode.