Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

Uday Allu,Biddwan Ahmed,Vishesh Tripathi
2024-02-10
Abstract:The conventional use of the Retrieval-Augmented Generation (RAG) architecture has proven effective for retrieving information from diverse documents. However, challenges arise in handling complex table queries, especially within PDF documents containing intricate tabular structures.This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems. Our methodology involves storing PDFs in the retrieval database and extracting tabular content separately. The extracted tables undergo a process of context enrichment, concatenating headers with corresponding values. To ensure a comprehensive understanding of the enriched data, we employ a fine-tuned version of the Llama-2-chat language model for summarisation within the RAG architecture. Furthermore, we augment the tabular data with contextual sense using the ChatGPT 3.5 API through a one-shot prompt. This enriched data is then fed into the retrieval database alongside other PDFs. Our approach aims to significantly improve the precision of complex table queries, offering a promising solution to a longstanding challenge in information retrieval.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient accuracy in existing Retrieval-Augmented Generation (RAG) architectures when handling complex table queries in information retrieval. Specifically, while traditional RAG architectures excel in retrieving information from diverse documents, they face challenges when dealing with complex table structures in PDF documents. These issues include: 1. **Complexity of Table Data**: The table structures in PDF documents are often very complex, containing multiple layers of nesting and various formats, making it difficult for traditional text extraction methods to accurately parse these tables. 2. **Lack of Context Understanding**: Traditional RAG architectures primarily rely on text retrieval and lack context understanding of table contents, leading to lower accuracy when handling complex table queries. To address these issues, the paper proposes an innovative approach to enhance the RAG architecture's ability to handle complex table queries through the following steps: 1. **PDF Storage and Table Extraction**: Store PDF documents in the retrieval database and extract table contents separately. 2. **Context Enrichment**: Enrich the extracted table contents by concatenating headers with corresponding values to retain the contextual information within the tables. 3. **Integration of Language Models**: Use a fine-tuned Llama-2-chat language model to summarize the table contents and further enhance context understanding through the ChatGPT 3.5 API. 4. **Data Storage and Retrieval**: Store the enriched table data along with the original PDFs in the retrieval database to improve the accuracy of complex table queries. Through this series of methods, the paper aims to significantly improve the accuracy of RAG architectures in handling complex table queries, thereby addressing a long-standing challenge in the field of information retrieval.