GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables

Mathieu Huot,Matin Ghavami,Alexander K. Lew,Ulrich Schaechtle,Cameron E. Freer,Zane Shelby,Martin C. Rinard,Feras A. Saad,Vikash K. Mansinghka
DOI: https://doi.org/10.1145/3656409
2024-06-22
Abstract:This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
Programming Languages
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve how to efficiently query and operate generative models to handle probabilistic data in database tables. Specifically, the author proposes a system named GenSQL, which extends the traditional SQL language to enable seamless integration with probabilistic generative models. In this way, users can easily perform complex Bayesian inference tasks, such as: 1. **Generate synthetic data**: Generate new data records according to user - defined constraints. 2. **Conditional distribution inference**: Condition the probability model given the observed data. 3. **Database operations**: Combine standard SQL queries and the results of probability models to perform aggregation, filtering, etc. #### Main challenges - **Limitations of existing systems**: Most existing probabilistic programming systems only support specifying generative models and estimating parameters, but do not support combining these models with complex database queries. - **User - friendliness**: In order to enable users to easily perform complex Bayesian inference tasks in the database, a concise and easy - to - use interface is required. - **Performance optimization**: Ensure that the query performance is efficient and can quickly process large - scale data in practical applications. #### Key features of GenSQL 1. **Extended SQL**: By introducing several key syntactic structures, such as `GENERATE UNDER`, `GIVEN`, `GENERATIVE JOIN` and `PROBABILITY OF`, SQL can handle probability models. 2. **Abstract Model Interface (AMI)**: Provides a unified interface, making different probability models compatible with GenSQL and supporting multiple probabilistic programming languages. 3. **Formal verification**: Through type systems and semantic definitions, ensure the correctness and consistency of query results, including guarantees for exact inference and approximate inference. 4. **Open - source implementation**: Provides implementations of multiple probability models and demonstrates its performance advantages in practical applications. ### Summary GenSQL solves the deficiencies of traditional systems in handling complex Bayesian inference tasks by extending the SQL language, allowing users to directly use probabilistic generative models in database queries. This not only improves user - friendliness but also significantly improves query performance and accuracy.