OpenTab: Advancing Large Language Models as Open-domain Table Reasoners

Kezhi Kong,Jiani Zhang,Zhengyuan Shen,Balasubramaniam Srinivasan,Chuan Lei,Christos Faloutsos,Huzefa Rangwala,George Karypis
2024-02-22
Abstract:Large Language Models (LLMs) trained on large volumes of data excel at various natural language tasks, but they cannot handle tasks requiring knowledge that has not been trained on previously. One solution is to use a retriever that fetches relevant information to expand LLM's knowledge scope. However, existing textual-oriented retrieval-based LLMs are not ideal on structured table data due to diversified data modalities and large table sizes. In this work, we propose OpenTab, an open-domain table reasoning framework powered by LLMs. Overall, OpenTab leverages table retriever to fetch relevant tables and then generates SQL programs to parse the retrieved tables efficiently. Utilizing the intermediate data derived from the SQL executions, it conducts grounded inference to produce accurate response. Extensive experimental evaluation shows that OpenTab significantly outperforms baselines in both open- and closed-domain settings, achieving up to 21.5% higher accuracy. We further run ablation studies to validate the efficacy of our proposed designs of the system.
Machine Learning
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges that large - language models (LLMs) encounter when processing structured tabular data. Specifically, the existing retrieval - based LLMs have the following problems when dealing with tabular data: 1. **Diverse data modalities and large tables**: Structured tables contain multiple data types, especially a large amount of or precise numerical data, which will lead to high token usage, thus challenging the model's memory and computing power. 2. **Complex table - relation understanding**: LLMs are mainly optimized for natural - language understanding and have difficulty effectively parsing the complex relationships in tables to perform effective data transformation and answer extraction. 3. **Limited maximum context length**: The context - length limitation of LLMs makes it difficult to handle large - scale tables, especially when dealing with tables containing millions of rows. To solve these problems, the author proposes a framework named OPENTAB, which can handle tabular - reasoning tasks in an open - domain environment. The main goals of OPENTAB are: - **Automatically identify and retrieve relevant tables**: Automatically retrieve tables related to natural - language queries from a large number of table corpora. - **Generate SQL programs**: Efficiently parse the retrieved tables by generating high - quality SQL queries. - **Reason based on intermediate data**: Utilize the intermediate data in the SQL execution results to conduct well - founded reasoning and generate accurate answers. In addition, OPENTAB also introduces the following key strategies to improve performance: - **Generative Reranking & Sequential Reasoning (GRSR)**: By generating SQL queries and re - ranking tables according to query similarity, effectively deal with the hallucination problem of LLMs and improve prediction accuracy. - **Simple - to - complex prompting strategy**: Gradually generate SQL queries from simple to complex, ensuring a wider range of solution exploration and enhancing the robustness of the system. Through these methods, OPENTAB significantly outperforms the baseline methods in both open - domain and closed - domain settings, especially when dealing with large - scale tabular data.