Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Demiao Lin

2024-01-23

Abstract:With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.

Artificial Intelligence

What problem does this paper attempt to address?

The paper aims to address the limitations in the retrieval-augmented generation (RAG) systems, particularly focusing on improving the effectiveness of RAG by enhancing the parsing and chunking of PDF documents. The key problems targeted include: 1. **Low Accuracy in PDF Parsing**: Current RAG systems heavily rely on high-quality text corpora, but professional documents are often stored in PDF format, which poses significant challenges due to the low accuracy of PDF parsing. 2. **Impact on Professional Knowledge-based Question Answering (QA)**: The low accuracy of PDF parsing affects the retrieval of pertinent information, which is crucial for effective professional knowledge-based QA. The authors conduct empirical experiments across hundreds of questions from real-world professional documents to evaluate the impact of enhanced PDF structure recognition on RAG systems. They develop ChatDOC, a RAG system equipped with advanced PDF parsing capabilities, and compare it against a baseline system using a rule-based parser (PyPDF). The main contributions of the paper are: - **Development of ChatDOC**: A RAG system that utilizes a deep learning-based PDF parser to improve the accuracy and completeness of retrieved segments, leading to better answers. - **Empirical Evaluation**: Conducting experiments that demonstrate ChatDOC's superiority over the baseline system, showing improved performance in nearly 47% of questions, tying in 38%, and falling short in only 15%. The paper fo

Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

DR-RAG: Applying Dynamic Document Relevance to Retrieval-Augmented Generation for Question-Answering

Retrieval-Augmented Generation for Large Language Models: A Survey

Improving Retrieval for RAG based Question Answering Models on Financial Documents

Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering

Document Knowledge Graph to Enhance Question Answering with Retrieval Augmented Generation

Advanced Retrieval Augmented Generation: Multilingual Semantic Retrieval across Document Types by Finetuning Transformer Based Language Models and OCR Integration

Enhancing Retrieval and Managing Retrieval: A Four-Module Synergy for Improved Quality and Efficiency in RAG Systems

Meta Knowledge for Retrieval Augmented Large Language Models

A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

A Survey on Retrieval-Augmented Text Generation for Large Language Models

Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation

Retrieval-Augmented Generation for Natural Language Processing: A Survey

RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation

Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering

HiQA: A Hierarchical Contextual Augmentation RAG for Multi-Documents QA

RAGChecker: A Fine-grained Framework for Diagnosing Retrieval-Augmented Generation