Abstract:In many use-cases, information is stored in text but not available in structured data. However, extracting data from natural language text to precisely fit a schema, and thus enable querying, is a challenging task. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases by tapping the information in LLMs. To ground this vision, we present Galois, a prototype based on a traditional database architecture, but with new physical operators for querying the underlying LLM. The main idea is to execute some operators of the the query plan with prompts that retrieve data from the LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. Preliminary experimental results make pre-trained LLMs a promising addition to the field of database systems, introducing a new direction for hybrid query processing. However, we pinpoint several research challenges that must be addressed to build a DBMS that exploits LLMs. While some of these challenges necessitate integrating concepts from the NLP literature, others offer novel research avenues for the DB community.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: how to use pre - trained large - scale language models (LLMs) to execute SQL queries, so as to be able to extract structured information from unstructured text data. Traditionally, SQL can only be executed on structured data sets with clear schemas, while a large amount of information exists in the form of unstructured text and cannot be directly queried through SQL. With the development of pre - trained large - scale language models, these models have been able to store and use the information extracted from a large number of text documents. Therefore, the author proposes a new method, that is, accessing the information stored in LLMs through SQL queries, which not only expands the application range of SQL, but also provides a more accurate and more expressive way to handle data other than natural language prompts. Specifically, the paper proposes the following challenges and solutions: 1. **Data Extraction and Schema Matching**: Since LLMs do not have the traditional database schema concept, how to map SQL queries to LLM and ensure that the returned data conforms to the expected schema is a challenge. The author shows how to decompose complex tasks into simple steps through logical query plans by constructing a prototype system Galois, and each step can be effectively processed by LLM. 2. **Data Accuracy and Completeness**: Although LLMs can store high - quality factual information, they may also produce wrong or incomplete answers. Galois improves the ability of LLMs in handling complex tasks through a series of intermediate reasoning steps (such as "chain - of - thought" and problem decomposition) to ensure the accuracy and completeness of query results. 3. **Architecture Design**: The paper explores two possible architectures: LLM - first and DB - first. The author chooses the DB - first method, that is, using LLMs as a component of the traditional database query processing architecture. This method is more suitable for handling complex operations such as aggregate queries that require a large number of tuple inputs. Through the above methods, the paper shows the possibility of pre - training LLMs through SQL queries, providing a new direction for future research and applications.

Querying Large Language Models with SQL

Evaluating SQL Understanding in Large Language Models

A Survey on Employing Large Language Models for Text-to-SQL Tasks

A Survey of NL2SQL with Large Language Models: Where are we, and where are we going?

Blar-SQL: Faster, Stronger, Smaller NL2SQL

Lucy: Think and Reason to Solve Text-to-SQL

Large Language Model Enhanced Text-to-SQL Generation: A Survey

DB-GPT: Large Language Model Meets Database

Towards Evaluating Large Language Models for Graph Query Generation

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL (extended)

Hybrid Querying Over Relational Databases and Large Language Models

Relational Database Augmented Large Language Model

Large Language Models and Knowledge Graphs: Opportunities and Challenges

Exploring the Use of LLMs for SQL Equivalence Checking

A Survey of Large Language Model-Based Generative AI for Text-to-SQL: Benchmarks, Applications, Use Cases, and Challenges

Analyzing the Effectiveness of Large Language Models on Text-to-SQL Synthesis

Large Language Model for Table Processing: A Survey

Making LLMs Work for Enterprise Data Tasks

Can LLMs substitute SQL? Comparing Resource Utilization of Querying LLMs versus Traditional Relational Databases

SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL