Abstract:In the fields of Optical Character Recognition (OCR) and Natural Language Processing (NLP), integrating multilingual capabilities remains a critical challenge, especially when considering languages with complex scripts such as Arabic. This paper introduces the Comprehensive Post-OCR Parsing and Receipt Understanding Dataset (CORU), a novel dataset specifically designed to enhance OCR and information extraction from receipts in multilingual contexts involving Arabic and English. CORU consists of over 20,000 annotated receipts from diverse retail settings, including supermarkets and clothing stores, alongside 30,000 annotated images for OCR that were utilized to recognize each detected line, and 10,000 items annotated for detailed information extraction. These annotations capture essential details such as merchant names, item descriptions, total prices, receipt numbers, and dates. They are structured to support three primary computational tasks: object detection, OCR, and information extraction. We establish the baseline performance for a range of models on CORU to evaluate the effectiveness of traditional methods, like Tesseract OCR, and more advanced neural network-based approaches. These baselines are crucial for processing the complex and noisy document layouts typical of real-world receipts and for advancing the state of automated multilingual document processing. Our datasets are publicly accessible (<a class="link-external link-https" href="https://github.com/Update-For-Integrated-Business-AI/CORU" rel="external noopener nofollow">this https URL</a>).

KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension

JaQuAD: Japanese Question Answering Dataset for Machine Reading Comprehension

ArQuAD: An Expert-Annotated Arabic Machine Reading Comprehension Dataset

UQuAD1.0: Development of an Urdu Question Answering Training Data for Machine Reading Comprehension

KorMedMCQA: Multi-Choice Question Answering Benchmark for Korean Healthcare Professional Licensing Examinations

AI Student: A Machine Reading Comprehension System for the Korean College Scholastic Ability Test

A Vietnamese Dataset for Evaluating Machine Reading Comprehension

KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

A Span-Extraction Dataset for Chinese Machine Reading Comprehension

KMMLU: Measuring Massive Multitask Language Understanding in Korean

NorQuAD: Norwegian Question Answering Dataset

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

CORU: Comprehensive Post-OCR Parsing and Receipt Understanding Dataset

Multi-Paragraph Machine Reading Comprehension with Hybrid Reader over Tables and Text

BanglaQuAD: A Bengali Open-domain Question Answering Dataset

Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

Cross-Lingual Question Answering over Knowledge Base as Reading Comprehension

MA-MRC: A Multi-answer Machine Reading Comprehension Dataset

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

KazQAD: Kazakh Open-Domain Question Answering Dataset