OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

Junyuan Zhang,Qintong Zhang,Bin Wang,Linke Ouyang,Zichen Wen,Ying Li,Ka-Ho Chow,Conghui He,Wentao Zhang
2024-12-04
Abstract:Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining. As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR). However, given the imperfect prediction of OCR and the inherent non-uniform representation of structured data, knowledge bases inevitably contain various OCR noises. In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems. OHRBench includes 350 carefully selected unstructured PDF documents from six real-world RAG application domains, along with Q&As derived from multimodal elements in documents, challenging existing OCR solutions used for RAG To better understand OCR's impact on RAG systems, we identify two primary types of OCR noise: Semantic Noise and Formatting Noise and apply perturbation to generate a set of structured data with varying degrees of each OCR noise. Using OHRBench, we first conduct a comprehensive evaluation of current OCR solutions and reveal that none is competent for constructing high-quality knowledge bases for RAG systems. We then systematically evaluate the impact of these two noise types and demonstrate the vulnerability of RAG systems. Furthermore, we discuss the potential of employing Vision-Language Models (VLMs) without OCR in RAG systems. Code: <a class="link-external link-https" href="https://github.com/opendatalab/OHR-Bench" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the cascading impact of Optical Character Recognition (OCR) in Retrieval - Augmented Generation (RAG) systems. Specifically, the paper focuses on how to extract structured data from unstructured PDF documents through OCR and build a high - quality knowledge base to support the performance of RAG systems. However, due to the imperfection of OCR predictions and the non - uniformity of structured data representations, various OCR noises will inevitably be included in the knowledge base, which will affect the performance of RAG systems. To study this problem, the paper introduced OHRBench, a new benchmarking tool designed to evaluate the impact of OCR on RAG systems. OHRBench includes data extracted from 350 carefully selected unstructured PDF documents in six practical application areas, as well as question - answer pairs derived from the multi - modal elements of these documents. These documents cover areas such as textbooks, law, finance, newspapers, manuals, and academia, and have complex layouts and multi - modal elements such as tables and formulas, which pose challenges for building a high - quality knowledge base. ### Main Contributions 1. **OHRBench Benchmark**: - It contains a variety of unstructured PDF documents from six RAG application areas. - It provides question - answer pairs derived from multi - modal document elements, challenging the application of current OCR solutions in RAG systems. 2. **OCR Noise Types**: - Two main types of OCR noise are identified: Semantic Noise and Formatting Noise. - By systematically perturbing real - data, structured data sets with different degrees of these two types of noise are generated for further exploration of RAG - customized OCR solutions. 3. **Comprehensive Evaluation**: - A comprehensive evaluation of current OCR solutions is carried out, revealing that even the best OCR solutions have at least a 7.5% performance gap. - The fine - grained and cascading impacts of these two types of OCR noise on RAG performance are systematically analyzed, providing valuable insights for developing RAG - customized OCR solutions. ### Experimental Results - **Evaluation of OCR Solutions**: - Pipeline - style OCR shows the best performance, especially when dealing with documents with complex layouts. - End - to - end OCR and Vision - Language Models (VLM) perform well in some areas, but there are significant performance drops when dealing with complex layouts and unseen data distributions. - All OCR solutions have performance losses at different stages, and even the optimal solution shows an obvious performance drop in the overall evaluation. - **Impact of OCR Noise**: - Semantic noise has a significant impact on the retrieval and generation stages. As the degree of perturbation increases, the performance of most retrievers and LLMs drops by nearly 50%. - Formatting noise has different impacts on different retrievers and LLMs, specifically manifested as different impacts on the performance of specific components. ### Conclusion Current OCR solutions are difficult to ensure robustness and effectiveness in diverse practical RAG application scenarios. In addition, the edit - distance metric of OCR is not always related to RAG performance. Therefore, further research and development of customized OCR solutions for RAG systems are required to reduce the impact of OCR noise on RAG performance.