Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

Uday Allu,Biddwan Ahmed,Vishesh Tripathi

2024-02-10

Abstract:The conventional use of the Retrieval-Augmented Generation (RAG) architecture has proven effective for retrieving information from diverse documents. However, challenges arise in handling complex table queries, especially within PDF documents containing intricate tabular structures.This research introduces an innovative approach to enhance the accuracy of complex table queries in RAG-based systems. Our methodology involves storing PDFs in the retrieval database and extracting tabular content separately. The extracted tables undergo a process of context enrichment, concatenating headers with corresponding values. To ensure a comprehensive understanding of the enriched data, we employ a fine-tuned version of the Llama-2-chat language model for summarisation within the RAG architecture. Furthermore, we augment the tabular data with contextual sense using the ChatGPT 3.5 API through a one-shot prompt. This enriched data is then fed into the retrieval database alongside other PDFs. Our approach aims to significantly improve the precision of complex table queries, offering a promising solution to a longstanding challenge in information retrieval.

Machine Learning,Computation and Language

What problem does this paper attempt to address?

The paper attempts to address the issue of insufficient accuracy in existing Retrieval-Augmented Generation (RAG) architectures when handling complex table queries in information retrieval. Specifically, while traditional RAG architectures excel in retrieving information from diverse documents, they face challenges when dealing with complex table structures in PDF documents. These issues include: 1. **Complexity of Table Data**: The table structures in PDF documents are often very complex, containing multiple layers of nesting and various formats, making it difficult for traditional text extraction methods to accurately parse these tables. 2. **Lack of Context Understanding**: Traditional RAG architectures primarily rely on text retrieval and lack context understanding of table contents, leading to lower accuracy when handling complex table queries. To address these issues, the paper proposes an innovative approach to enhance the RAG architecture's ability to handle complex table queries through the following steps: 1. **PDF Storage and Table Extraction**: Store PDF documents in the retrieval database and extract table contents separately. 2. **Context Enrichment**: Enrich the extracted table contents by concatenating headers with corresponding values to retain the contextual information within the tables. 3. **Integration of Language Models**: Use a fine-tuned Llama-2-chat language model to summarize the table contents and further enhance context understanding through the ChatGPT 3.5 API. 4. **Data Storage and Retrieval**: Store the enriched table data along with the original PDFs in the retrieval database to improve the accuracy of complex table queries. Through this series of methods, the paper aims to significantly improve the accuracy of RAG architectures in handling complex table queries, thereby addressing a long-standing challenge in the field of information retrieval.

Beyond Extraction: Contextualising Tabular Data for Efficient Summarisation by Language Models

Reasoning-Aware Query-Focused Summarization over Multi-Table Data

Developing Retrieval Augmented Generation (RAG) based LLM Systems from PDFs: An Experience Report

Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering

Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications

TableRAG: Million-Token Table Understanding with Language Models

Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

QTSumm: Query-Focused Summarization over Tabular Data

Towards a Robust Retrieval-Based Summarization System

RAG based Chatbot using LLMs

UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis

Evaluation of Table Representations to Answer Questions from Tables in Documents : A Case Study using 3GPP Specifications

Abstractive and Extractive Text Summarization using Document Context Vector and Recurrent Neural Networks

Bridging the Gap: Deciphering Tabular Data Using Large Language Model

Contextual embedded text summarizer system: A hybrid approach

Context Tuning for Retrieval Augmented Generation

QFMTS: Generating Query-Focused Summaries over Multi-Table Inputs

Enhancing Retrieval Processes for Language Generation with Augmented Queries

Meta Knowledge for Retrieval Augmented Large Language Models

From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries