Querying Large Language Models with SQL

Mohammed Saeed,Nicola De Cao,Paolo Papotti
2023-10-25
Abstract:In many use-cases, information is stored in text but not available in structured data. However, extracting data from natural language text to precisely fit a schema, and thus enable querying, is a challenging task. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. Thus, we envision the use of SQL queries to cover a broad range of data that is not captured by traditional databases by tapping the information in LLMs. To ground this vision, we present Galois, a prototype based on a traditional database architecture, but with new physical operators for querying the underlying LLM. The main idea is to execute some operators of the the query plan with prompts that retrieve data from the LLM. For a large class of SQL queries, querying LLMs returns well structured relations, with encouraging qualitative results. Preliminary experimental results make pre-trained LLMs a promising addition to the field of database systems, introducing a new direction for hybrid query processing. However, we pinpoint several research challenges that must be addressed to build a DBMS that exploits LLMs. While some of these challenges necessitate integrating concepts from the NLP literature, others offer novel research avenues for the DB community.
Databases,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to use pre - trained large - scale language models (LLMs) to execute SQL queries, so as to be able to extract structured information from unstructured text data. Traditionally, SQL can only be executed on structured data sets with clear schemas, while a large amount of information exists in the form of unstructured text and cannot be directly queried through SQL. With the development of pre - trained large - scale language models, these models have been able to store and use the information extracted from a large number of text documents. Therefore, the author proposes a new method, that is, accessing the information stored in LLMs through SQL queries, which not only expands the application range of SQL, but also provides a more accurate and more expressive way to handle data other than natural language prompts. Specifically, the paper proposes the following challenges and solutions: 1. **Data Extraction and Schema Matching**: Since LLMs do not have the traditional database schema concept, how to map SQL queries to LLM and ensure that the returned data conforms to the expected schema is a challenge. The author shows how to decompose complex tasks into simple steps through logical query plans by constructing a prototype system Galois, and each step can be effectively processed by LLM. 2. **Data Accuracy and Completeness**: Although LLMs can store high - quality factual information, they may also produce wrong or incomplete answers. Galois improves the ability of LLMs in handling complex tasks through a series of intermediate reasoning steps (such as "chain - of - thought" and problem decomposition) to ensure the accuracy and completeness of query results. 3. **Architecture Design**: The paper explores two possible architectures: LLM - first and DB - first. The author chooses the DB - first method, that is, using LLMs as a component of the traditional database query processing architecture. This method is more suitable for handling complex operations such as aggregate queries that require a large number of tuple inputs. Through the above methods, the paper shows the possibility of pre - training LLMs through SQL queries, providing a new direction for future research and applications.