CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Abdelrahman Abdallah,Mahmoud Abdalla,Mahmoud SalahEldin Kasem,Mohamed Mahmoud,Ibrahim Abdelhalim,Mohamed Elkasaby,Yasser ElBendary,Adam Jatowt

2024-06-07

Abstract:In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (<a class="link-external link-https" href="https://github.com/Update-For-Integrated-Business-AI/CORU" rel="external noopener nofollow">this https URL</a>).

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The paper aims to address the challenges of Optical Character Recognition (OCR) and information extraction in multilingual environments, particularly involving complex character sets in Arabic and English. Specifically, the paper introduces a new dataset named CORU (Comprehensive Post-OCR Parsing and Receipt Understanding), which is specifically designed to enhance OCR and information extraction capabilities from receipts, especially in multilingual settings. The CORU dataset contains over 20,000 annotated receipt images from various retail environments such as supermarkets and clothing stores, as well as 30,000 annotated images for OCR tasks and 10,000 annotated data points for detailed information extraction. These annotations capture key details such as merchant names, item descriptions, total prices, receipt numbers, and dates. Additionally, the paper establishes benchmark performance for a range of models on the CORU dataset, including traditional Tesseract OCR methods and more advanced neural network-based approaches. These benchmarks are crucial for handling the complexity and noise of real-world receipt layouts, aiding the advancement of automated multilingual document processing technologies. The CORU dataset is publicly available and can be used for research and development of more efficient OCR and information extraction systems.

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

AMuRD: Annotated Arabic-English Receipt Dataset for Key Information Extraction and Classification

UTRNet: High-Resolution Urdu Text Recognition In Printed Documents

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction

ADOCRNet: A Deep Learning OCR for Arabic Documents Recognition

PEaCE: A Chemistry-Oriented Dataset for Optical Character Recognition on Scientific Documents

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

Extending TrOCR for Text Localization-Free OCR of Full-Page Scanned Receipt Images

DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

Advancements and Challenges in Arabic Optical Character Recognition: A Comprehensive Survey

An Efficient Language-Independent Multi-Font OCR for Arabic Script

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents

ArabicaQA: A Comprehensive Dataset for Arabic Question Answering

MMOCR: A Comprehensive Toolbox for Text Detection, Recognition and Understanding.

An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

Making Old Kurdish Publications Processable by Augmenting Available Optical Character Recognition Engines

A Novel Approach to Printed Arabic Optical Character Recognition

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

A Survey of OCR in Arabic Language: Applications, Techniques, and Challenges

RPC: A Large-Scale Retail Product Checkout Dataset