Abstract:Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in <a class="link-external link-https" href="https://github.com/opendatalab/OmniDocBench" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the limitations of current document content extraction methods in terms of diversity and comprehensive evaluation. Specifically: 1. **Limited document types**: Current evaluations mainly focus on a single type of academic papers, while real - world application scenarios include textbooks, exam papers, financial reports, newspapers, magazines and other types of documents. 2. **Monotonous evaluation dimensions**: Module - based methods usually only evaluate specific algorithm modules, such as OCR, layout detection or formula recognition, while the quality of the overall parsing results requires more comprehensive evaluation metrics. 3. **Inadequate evaluation metrics**: Although multi - modal large - model methods attempt to evaluate document parsing quality from multiple dimensions, commonly used evaluation metrics (such as BLEU scores or edit distances) cannot accurately and fairly evaluate the parsing effect when dealing with markup languages such as LaTeX or HTML. To meet these challenges, the paper proposes a new multi - source benchmark named OmniDocBench, aiming to promote the development of automatic document content extraction technology. OmniDocBench has the following characteristics: - **High - quality and diverse evaluation sets**: Through automated annotation, manual verification and expert review, a comprehensive, detailed and high - quality evaluation set containing nine different types of document pages has been constructed. - **Support for flexible and comprehensive evaluation dimensions**: The evaluation set covers 19 layout category labels and 14 attribute labels, supporting evaluation from the overall, individual modules or different data types. - **Comprehensive evaluation of mainstream methods**: Based on OmniDocBench, a comprehensive evaluation of the current mainstream modular pipeline and end - to - end large - model methods has been carried out, providing a fair evaluation of existing methods and summarizing the shortcomings of current document parsing methods, thereby guiding the further development of document parsing technology. Through these contributions, OmniDocBench has established a strong, diverse and fair evaluation standard in the field of document content extraction, providing important insights for future progress and development.

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

OmniBench: Towards The Future of Universal Omni-Language Models

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations

READoc: A Unified Benchmark for Realistic Document Structured Extraction

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding

OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large Language Model and its Omni-Extensions

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

MinerU: An Open-Source Solution for Precise Document Content Extraction

OmniCorpus: an Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

A Comparative Study of PDF Parsing Tools Across Diverse Document Categories

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

OmniCity: Omnipotent City Understanding with Multi-level and Multi-view Images

OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents

PP-StructureV2: A Stronger Document Analysis System