Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Demiao Lin
2024-01-23
Abstract:With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.
Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address the limitations in the retrieval-augmented generation (RAG) systems, particularly focusing on improving the effectiveness of RAG by enhancing the parsing and chunking of PDF documents. The key problems targeted include: 1. **Low Accuracy in PDF Parsing**: Current RAG systems heavily rely on high-quality text corpora, but professional documents are often stored in PDF format, which poses significant challenges due to the low accuracy of PDF parsing. 2. **Impact on Professional Knowledge-based Question Answering (QA)**: The low accuracy of PDF parsing affects the retrieval of pertinent information, which is crucial for effective professional knowledge-based QA. The authors conduct empirical experiments across hundreds of questions from real-world professional documents to evaluate the impact of enhanced PDF structure recognition on RAG systems. They develop ChatDOC, a RAG system equipped with advanced PDF parsing capabilities, and compare it against a baseline system using a rule-based parser (PyPDF). The main contributions of the paper are: - **Development of ChatDOC**: A RAG system that utilizes a deep learning-based PDF parser to improve the accuracy and completeness of retrieved segments, leading to better answers. - **Empirical Evaluation**: Conducting experiments that demonstrate ChatDOC's superiority over the baseline system, showing improved performance in nearly 47% of questions, tying in 38%, and falling short in only 15%. The paper fo