OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang,Yuan Qu,Hongbin Zhou,Jiawei Zhu,Rui Zhang,Qunshu Lin,Bin Wang,Zhiyuan Zhao,Man Jiang,Xiaomeng Zhao,Jin Shi,Fan Wu,Pei Chu,Minghao Liu,Zhenxiang Li,Chao Xu,Bo Zhang,Botian Shi,Zhongying Tu,Conghui He
2024-12-11
Abstract:Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in <a class="link-external link-https" href="https://github.com/opendatalab/OmniDocBench" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Information Retrieval
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the limitations of current document content extraction methods in terms of diversity and comprehensive evaluation. Specifically: 1. **Limited document types**: Current evaluations mainly focus on a single type of academic papers, while real - world application scenarios include textbooks, exam papers, financial reports, newspapers, magazines and other types of documents. 2. **Monotonous evaluation dimensions**: Module - based methods usually only evaluate specific algorithm modules, such as OCR, layout detection or formula recognition, while the quality of the overall parsing results requires more comprehensive evaluation metrics. 3. **Inadequate evaluation metrics**: Although multi - modal large - model methods attempt to evaluate document parsing quality from multiple dimensions, commonly used evaluation metrics (such as BLEU scores or edit distances) cannot accurately and fairly evaluate the parsing effect when dealing with markup languages such as LaTeX or HTML. To meet these challenges, the paper proposes a new multi - source benchmark named OmniDocBench, aiming to promote the development of automatic document content extraction technology. OmniDocBench has the following characteristics: - **High - quality and diverse evaluation sets**: Through automated annotation, manual verification and expert review, a comprehensive, detailed and high - quality evaluation set containing nine different types of document pages has been constructed. - **Support for flexible and comprehensive evaluation dimensions**: The evaluation set covers 19 layout category labels and 14 attribute labels, supporting evaluation from the overall, individual modules or different data types. - **Comprehensive evaluation of mainstream methods**: Based on OmniDocBench, a comprehensive evaluation of the current mainstream modular pipeline and end - to - end large - model methods has been carried out, providing a fair evaluation of existing methods and summarizing the shortcomings of current document parsing methods, thereby guiding the further development of document parsing technology. Through these contributions, OmniDocBench has established a strong, diverse and fair evaluation standard in the field of document content extraction, providing important insights for future progress and development.