TableRAG: Million-Token Table Understanding with Language Models

Si-An Chen,Lesly Miculicich,Julian Martin Eisenschlos,Zifeng Wang,Zilong Wang,Yanfei Chen,Yasuhisa Fujii,Hsuan-Tien Lin,Chen-Yu Lee,Tomas Pfister
2024-10-07
Abstract:Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
Computation and Language,Artificial Intelligence,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to overcome the limitations of existing language models (LMs) in terms of context length, computational cost, and reasoning ability when processing large - scale tabular data, so as to achieve efficient and accurate tabular understanding tasks**. ### Problem Background In recent years, language models have made significant progress in tabular understanding tasks, mainly by operating and analyzing tables through program - assisted mechanisms. However, these methods usually require the entire table as input, which leads to the following challenges: 1. **Context Length Limitation**: Large tables (such as a table containing 100 columns and 200 rows) may exceed 40,000 tokens, exceeding the context length limitations of popular language models (such as LLaMA and GPT series). 2. **Decline in Reasoning Ability**: A long context may lead to a decline in reasoning ability, the so - called "getting lost in the middle" phenomenon. 3. **Increased Computational Cost and Latency**: As the size of the table increases, the computational cost and latency increase significantly. In addition, methods such as simply truncating the table or only reading the table structure (schema) will lead to the loss of key information. Although existing methods based on schema and row - column retrieval can reduce the input length, they still face computational and performance challenges when dealing with tables with extremely large amounts of data. ### Solution To solve the above problems, the paper proposes **TableRAG**, a tabular understanding method based on the Retrieval - Augmented Generation (RAG) framework. The main features of TableRAG include: 1. **Query Expansion**: Generate multiple schema and cell queries to accurately locate key information. 2. **Schema Retrieval**: Identify key columns and their data types only by column names, avoiding encoding the entire column. 3. **Cell Retrieval**: Independently encode each cell and retrieve key cell values according to relevance, reducing encoding costs. 4. **Frequency - Aware Truncation**: Introduce an encoding budget B, limit the encoding of the most frequently occurring cell pairs, and improve the efficiency of processing large tables. ### Main Contributions 1. **First Extensive Research**: Explore the application of language models in large - scale real - world tables and analyze the scalability and limitations of existing methods. 2. **New Benchmark Datasets**: Construct two new benchmark datasets (Arcade and BIRD - SQL), and a synthetic dataset extended from TabFact, covering tables ranging from dozens to millions of cells. 3. **Efficient Framework**: Develop TableRAG, demonstrate its superior performance in processing large tables, and significantly reduce token consumption. Through these innovations, TableRAG can maintain high efficiency and accuracy when processing large - scale tables, solving the shortcomings of existing methods in scalability and performance.