MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang,Chao Xu,Xiaomeng Zhao,Linke Ouyang,Fan Wu,Zhiyuan Zhao,Rui Xu,Kaiwen Liu,Yuan Qu,Fukai Shang,Bo Zhang,Liqun Wei,Zhihao Sui,Wei Li,Botian Shi,Yu Qiao,Dahua Lin,Conghui He
2024-09-27
Abstract:Document content analysis has been a crucial research area in computer vision. Despite significant advancements in methods such as OCR, layout detection, and formula recognition, existing open-source solutions struggle to consistently deliver high-quality content extraction due to the diversity in document types and content. To address these challenges, we present MinerU, an open-source solution for high-precision document content extraction. MinerU leverages the sophisticated PDF-Extract-Kit models to extract content from diverse documents effectively and employs finely-tuned preprocessing and postprocessing rules to ensure the accuracy of the final results. Experimental results demonstrate that MinerU consistently achieves high performance across various document types, significantly enhancing the quality and consistency of content extraction. The MinerU open-source project is available at <a class="link-external link-https" href="https://github.com/opendatalab/MinerU" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: when dealing with diverse document types, existing open - source solutions have difficulty in consistently providing high - quality content extraction. Specifically, although methods such as OCR, layout detection, and formula recognition have made significant progress, due to the diversity of document types and the complexity of content, existing tools still face challenges in terms of accuracy and consistency. To solve these problems, the author proposes MinerU, an open - source solution for high - precision document content extraction. The main contributions of MinerU include: 1. **Adapt to multiple document layouts**: Support a wide range of document types, such as academic papers, textbooks, exam papers, and research reports. 2. **Content filtering**: Be able to filter out irrelevant areas (such as headers, footers, footnotes, and side notes), improving the readability of documents. 3. **Accurate segmentation**: Combine model - based and rule - based post - processing to achieve cross - column and cross - page paragraph merging. 4. **Powerful page element recognition**: Be able to accurately distinguish formulas, tables, images, text blocks, and their respective titles. Through these improvements, MinerU can achieve high - quality and consistent content extraction on various document types, significantly improving the quality and efficiency of document parsing. ### Specific problem summary - **OCR - based Text Extraction**: Extract text directly from documents, but introduce a large amount of noise for documents containing images, tables, and formulas. - **Library - based Text Parsing**: For non - scanned documents, use Python libraries to directly read content, but cannot handle documents containing complex elements such as formulas and tables. - **Multi - Module Document Parsing**: Adopt a multi - stage processing method. Although it can theoretically produce high - quality results, existing open - source models mainly focus on academic papers and perform poorly on other types of documents. - **End - to - End MLLM Document Parsing**: Use multi - modal large language models for document parsing, but still face challenges in terms of data diversity and inference cost. ### MinerU's solution MinerU mainly adopts a multi - module document parsing strategy and uses multiple models in PDF - Extract - Kit to process different types of document content. Its working process includes four stages: 1. **Document pre - processing**: Read PDF files, filter unprocessable files, and extract metadata. 2. **Document content parsing**: Use PDF - Extract - Kit to parse key content, including layout analysis, formula detection, and table recognition. 3. **Document content post - processing**: Remove invalid areas, splice content according to area location information, and ensure the accuracy of the final result. 4. **Format conversion**: Convert the processed result into the format required by the user (such as Markdown or JSON). Through the above process, MinerU can efficiently and high - quality extract the content of diverse documents, solving the shortcomings of existing tools in dealing with complex documents.