Abstract:Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks. Collecting real-world data, however, is often challenging due to privacy concerns, data protection regulations, high costs, and so on. To mitigate these challenges, the generation of synthetic data is a promising approach. In this paper, we solve the problem of generating synthetic relational data via probabilistic relational models. In particular, we propose a fully-fledged pipeline to go from relational database to probabilistic relational model, which can then be used to sample new synthetic relational data points from its underlying probability distribution. As part of our proposed pipeline, we introduce a learning algorithm to construct a probabilistic relational model from a given relational database.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is to generate synthetic relational data through Probabilistic Relational Models (PRMs) in order to address challenges such as privacy issues, data protection regulations, and high costs when collecting real - relational data in the real world. Specifically, the author proposes a complete process from relational databases to Probabilistic Relational Models, so that new synthetic relational data points can be generated based on the underlying probability distribution. ### Detailed Explanation 1. **Background and Motivation** - **Privacy Issues**: In many application scenarios, such as medical records and financial transactions, directly using real data may disclose sensitive information and violate privacy regulations. - **Data Requirements**: Machine - learning tasks require a large amount of training data, but collecting real data is often fraught with difficulties. - **Advantages of Synthetic Data**: Synthetic data can provide sufficient training samples without violating personal privacy while maintaining the authenticity and complexity of the data. 2. **Method Overview** - **Probabilistic Relational Models (PRMs)**: Combining first - order logic and probability models, they can represent the relationships between objects and compactly encode the joint probability distribution. - **Process from Relational Database to PRM**: 1. **Constructing a Propositional Factor Graph (FG)**: First, learn a propositional factor graph (Factor Graph) from a given relational database. This process includes identifying variable nodes, factor nodes, and edges. 2. **Converting to a Parameterized Factor Graph (PFG)**: Use the Advanced Color Passing (ACP) algorithm to convert the propositional factor graph to a parameterized factor graph, abstracting indistinguishable groups of objects. 3. **Sampling to Generate Synthetic Data**: Sample new synthetic data points from the PFG, and these data points follow the underlying joint probability distribution. 3. **Technical Innovation Points** - **Handling Individual Objects and Relationships**: Traditional factor graph learning algorithms usually ignore the structural characteristics of data, while the process proposed in this paper can retain the information of objects and their relationships. - **Privacy Protection**: By grouping indistinguishable objects, the PFG naturally provides the basis for differential privacy, reducing the dependence on individual data. - **Interpretability**: The PFG can be used not only for generating synthetic data but also for probabilistic reasoning and causal reasoning, and has strong interpretability. 4. **Application Prospects** - **Training Machine - Learning Models**: The generated synthetic data can be used to train various machine - learning models, especially in fields with strict privacy requirements. - **Data Sharing**: Synthetic data can be safely shared without violating privacy. - **Guiding PFG Learning**: Use existing databases to generate more synthetic data to further optimize the PFG learning process. ### Formula Examples The formulas involved in the paper are as follows: - **Semantics of Factor Graph**: \[ P_G=\frac{1}{Z}\prod_{j = 1}^{m}\phi_j(A_j) \] where \(Z\) is the normalization constant, and \(A_j\) represents the random variables connected to the factor node \(\phi_j\). - **Semantics of Parameterized Factor Graph**: \[ P_G=\frac{1}{Z}\prod_{g_j\in G}\prod_{\phi_k\in \text{gr}(g_j)}\phi_k(A_k) \] Through these formulas, the paper shows how to learn from relational databases and generate synthetic data that conforms to the underlying probability distribution, thus resolving the contradiction between privacy protection and data requirements.

Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models

Adapting Differentially Private Synthetic Data to Relational Databases

PrivSyn: Differentially Private Data Synthesis

PrivLava: Synthesizing Relational Data with Foreign Keys under Differential Privacy

Relational data synthesis using generative adversarial networks

Relational Data Synthesis using Generative Adversarial Networks: A Design Space Exploration

Compiling Relational Database Schemata into Probabilistic Graphical Models

A Method for Implementing a Probabilistic Model as a Relational Database

Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling

Structure Learning of Probabilistic Relational Models from Incomplete Relational Data

Differentially Private Synthetic Data: Applied Evaluations and Enhancements

Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data

Differentially Private Synthetic Data Generation via Lipschitz-Regularised Variational Autoencoders

Privacy-Preserving Synthetic Data Generation for Recommendation Systems

Tabular Data Synthesis with Differential Privacy: A Survey

On the Challenges of Deploying Privacy-Preserving Synthetic Data in the Enterprise

SoK: Privacy-Preserving Data Synthesis

Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data

Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models

Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data

GFS: Graph-based Feature Synthesis for Prediction over Relational Databases