Abstract:Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at <a class="link-external link-https" href="https://github.com/opendatalab/MinerU" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: when dealing with diverse document types, existing open - source solutions have difficulty in consistently providing high - quality content extraction. Specifically, although methods such as OCR, layout detection, and formula recognition have made significant progress, due to the diversity of document types and the complexity of content, existing tools still face challenges in terms of accuracy and consistency. To solve these problems, the author proposes MinerU, an open - source solution for high - precision document content extraction. The main contributions of MinerU include: 1. **Adapt to multiple document layouts**: Support a wide range of document types, such as academic papers, textbooks, exam papers, and research reports. 2. **Content filtering**: Be able to filter out irrelevant areas (such as headers, footers, footnotes, and side notes), improving the readability of documents. 3. **Accurate segmentation**: Combine model - based and rule - based post - processing to achieve cross - column and cross - page paragraph merging. 4. **Powerful page element recognition**: Be able to accurately distinguish formulas, tables, images, text blocks, and their respective titles. Through these improvements, MinerU can achieve high - quality and consistent content extraction on various document types, significantly improving the quality and efficiency of document parsing. ### Specific problem summary - **OCR - based Text Extraction**: Extract text directly from documents, but introduce a large amount of noise for documents containing images, tables, and formulas. - **Library - based Text Parsing**: For non - scanned documents, use Python libraries to directly read content, but cannot handle documents containing complex elements such as formulas and tables. - **Multi - Module Document Parsing**: Adopt a multi - stage processing method. Although it can theoretically produce high - quality results, existing open - source models mainly focus on academic papers and perform poorly on other types of documents. - **End - to - End MLLM Document Parsing**: Use multi - modal large language models for document parsing, but still face challenges in terms of data diversity and inference cost. ### MinerU's solution MinerU mainly adopts a multi - module document parsing strategy and uses multiple models in PDF - Extract - Kit to process different types of document content. Its working process includes four stages: 1. **Document pre - processing**: Read PDF files, filter unprocessable files, and extract metadata. 2. **Document content parsing**: Use PDF - Extract - Kit to parse key content, including layout analysis, formula detection, and table recognition. 3. **Document content post - processing**: Remove invalid areas, splice content according to area location information, and ensure the accuracy of the final result. 4. **Format conversion**: Convert the processed result into the format required by the user (such as Markdown or JSON). Through the above process, MinerU can efficiently and high - quality extract the content of diverse documents, solving the shortcomings of existing tools in dealing with complex documents.

MinerU: An Open-Source Solution for Precise Document Content Extraction

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Extracting Web Content by Exploiting Multi-Category Characteristics

Matminer: an Open Source Toolkit for Materials Data Mining

OpenUE: an Open Toolkit of Universal Extraction from Text

BPMiner: mining developers' behavior patterns from screen-captured task videos.

Scientific Information Understanding Via Open Educational Resources (OER).

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

A Benchmark of PDF Information Extraction Tools using a Multi-Task and Multi-Domain Evaluation Framework for Academic Documents

Automatic Document Metadata Extraction Based on Deep Networks.

MolMiner: You only look once for chemical structure recognition

Chinese web page content extraction based on page content analysis

Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents

PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

CMiner: Opinion Extraction and Summarization for Chinese Microblogs

An OpenCV-based Framework for Table Information Extraction

Effective and efficient Semantic Table Interpretation using TableMiner+

CUTIE: Learning to Understand Documents with Convolutional Universal Text Information Extractor

AMiner: Search and Mining of Academic Social Networks

AMiner-mini: A People Search Engine for University.

Automatic content based title extraction for Chinese documents using support vector machine