Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

Ayman Asad Khan,Md Toufique Hasan,Kai Kristian Kemell,Jussi Rasku,Pekka Abrahamsson

2024-10-21

Abstract:This paper presents an experience report on the development of Retrieval Augmented Generation (RAG) systems using PDF documents as the primary data source. The RAG architecture combines generative capabilities of Large Language Models (LLMs) with the precision of information retrieval. This approach has the potential to redefine how we interact with and augment both structured and unstructured knowledge in generative models to enhance transparency, accuracy, and contextuality of responses. The paper details the end-to-end pipeline, from data collection, preprocessing, to retrieval indexing and response generation, highlighting technical challenges and practical solutions. We aim to offer insights to researchers and practitioners developing similar systems using two distinct approaches: OpenAI's Assistant API with GPT Series and Llama's open-source models. The practical implications of this research lie in enhancing the reliability of generative AI systems in various sectors where domain-specific knowledge and real-time information retrieval is important. The Python code used in this work is also available at: <a class="link-external link-https" href="https://github.com/GPT-Laboratory/RAG-LLM-Development-Guidebook-from-PDFs" rel="external noopener nofollow">this https URL</a>.

Software Engineering,Artificial Intelligence,Information Retrieval

What problem does this paper attempt to address?

The paper attempts to address the limitations of traditional large language models (LLMs) in handling dynamic information. Specifically, these models rely on static training data, leading to responses that may be outdated or incomplete, especially in knowledge-intensive tasks requiring real-time information retrieval. Additionally, these models often lack transparency and accuracy, which is particularly important in high-stakes decision-making. To tackle these issues, the paper proposes a development approach based on the Retrieval Augmented Generation (RAG) system. The RAG system retrieves information from external data sources (such as PDF documents, databases, or websites) and combines this information with the generative capabilities of large language models to produce responses that are both contextually relevant and factually accurate. This approach aims to enhance the transparency, accuracy, and contextual relevance of generative AI systems, particularly in industries that require domain-specific knowledge and real-time information retrieval. The paper details the end-to-end process from data collection and preprocessing to retrieval indexing and response generation, and discusses the technical challenges and practical solutions. In this way, the paper provides valuable insights for researchers and practitioners, helping them optimize RAG models to meet the needs of specific use cases.

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering

RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation

RAG based Chatbot using LLMs

An advanced retrieval-augmented generation system for manufacturing quality control

A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions

Deploying Large Language Models With Retrieval Augmented Generation

Retrieval-Augmented Generation for Large Language Models: A Survey

Meta Knowledge for Retrieval Augmented Large Language Models

Development and Testing of Retrieval Augmented Generation in Large Language Models -- A Case Study Report

Towards Understanding Retrieval Accuracy and Prompt Quality in RAG Systems

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation

AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline

Retrieval-Augmented Generation for Natural Language Processing: A Survey

The Chronicles of RAG: The Retriever, the Chunk and the Generator

Optimizing and Evaluating Enterprise Retrieval-Augmented Generation (RAG): A Content Design Perspective

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems

Retrieval augmented generation for building datasets from scientific literature

A Survey on Retrieval-Augmented Text Generation for Large Language Models