How Large Language Models Will Disrupt Data Management

Raul Castro Fernandez,Aaron J. Elmore,Michael J. Franklin,Sanjay Krishnan,Chenhao Tan
DOI: https://doi.org/10.14778/3611479.3611527
IF: 2.5
2023-07-01
Proceedings of the VLDB Endowment
Abstract:Large language models (LLMs), such as GPT-4, are revolutionizing software's ability to understand, process, and synthesize language. The authors of this paper believe that this advance in technology is significant enough to prompt introspection in the data management community, similar to previous technological disruptions such as the advents of the world wide web, cloud computing, and statistical machine learning. We argue that the disruptive influence that LLMs will have on data management will come from two angles. (1) A number of hard database problems, namely, entity resolution, schema matching, data discovery, and query synthesis, hit a ceiling of automation because the system does not fully understand the semantics of the underlying data. Based on large training corpora of natural language, structured data, and code, LLMs have an unprecedented ability to ground database tuples, schemas, and queries in real-world concepts. We will provide examples of how LLMs may completely change our approaches to these problems. (2) LLMs blur the line between predictive models and information retrieval systems with their ability to answer questions. We will present examples showing how large databases and information retrieval systems have complementary functionality.
computer science, information systems, theory & methods
What problem does this paper attempt to address?