UQE: A Query Engine for Unstructured Databases

Hanjun Dai,Bethany Yixin Wang,Xingchen Wan,Bo Dai,Sherry Yang,Azade Nova,Pengcheng Yin,Phitchaya Mangpo Phothilimthana,Charles Sutton,Dale Schuurmans
2024-06-23
Abstract:Analytics on structured data is a mature field with many successful methods. However, most real world data exists in unstructured form, such as images and conversations. We investigate the potential of Large Language Models (LLMs) to enable unstructured data analytics. In particular, we propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections. This engine accepts queries in a Universal Query Language (UQL), a dialect of SQL that provides full natural language flexibility in specifying conditions and operators. The new engine leverages the ability of LLMs to conduct analysis of unstructured data, while also allowing us to exploit advances in sampling and optimization techniques to achieve efficient and accurate query execution. In addition, we borrow techniques from classical compiler theory to better orchestrate the workflow between sampling methods and foundation model calls. We demonstrate the efficiency of UQE on data analytics across different modalities, including images, dialogs and reviews, across a range of useful query types, including conditional aggregation, semantic retrieval and abstraction aggregation.
Databases,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to analyze unstructured data in a flexible and efficient manner. Specifically, traditional SQL engines have limitations when dealing with unstructured data (such as images, conversations, etc.), because these data need to be pre - processed into a structured form in order to be effectively queried and analyzed. The paper proposes a new Universal Query Engine (UQE), aiming to directly extract information from unstructured data sets and achieve this goal by introducing a new query language (Universal Query Language, UQL). ### Main problems and solutions in the paper 1. **Problem description**: - **Limitations of traditional SQL engines**: Traditional SQL engines can only handle structured data under pre - defined schemas and cannot directly perform complex queries and analyses on unstructured data (such as images, texts, audios, etc.). - **Deficiencies of existing methods**: Existing methods such as full - text search engines and Support Vector Machines (SVM) can handle simple retrieval tasks, but perform poorly when executing more complex semantic reasoning queries. In addition, Retrieval - Augmented Generation (RAG) - based methods can achieve good results on some specific tasks, but are not suitable for aggregation queries and semantic queries in large - scale databases. 2. **Solutions**: - **Introducing UQE**: The paper proposes a new query engine named UQE, which can directly handle unstructured data and utilize the capabilities of large - language models (LLMs) for semantic understanding and analysis. - **Using UQL**: UQE supports a new query language UQL, which is a variant of SQL, allowing users to specify conditions and operators in natural language, thus enabling flexible queries on unstructured data. - **Optimizing query execution**: To improve query efficiency, UQE draws on techniques in classical compilation theory, combines sampling and optimization methods to ensure the efficiency and accuracy of query execution. For example, by using stratified sampling and online learning algorithms to reduce the number of LLM calls while maintaining the correctness of query semantics. ### Main contributions - **Proposing UQE**: A general - purpose query engine capable of handling unstructured data, supporting flexible semantic queries. - **Defining UQL**: A query language that extends the functionality of SQL, allowing users to describe query conditions and operations in natural language. - **Optimizing query execution**: By introducing statistical sampling techniques and online learning algorithms, significantly improves query efficiency and reduces the cost of LLM calls. - **Experimental verification**: Experiments were carried out on multiple benchmark datasets to verify the advantages of UQE over other methods in terms of accuracy and cost. ### Summary This paper solves the limitations of traditional SQL engines in handling unstructured data by introducing UQE and UQL, providing a flexible and efficient method for unstructured data analysis. By combining LLMs and optimization algorithms, UQE can perform complex query tasks on data of different modalities, providing new ideas and technical means for future unstructured data analysis.