Towards Privacy-Preserving Relational Data Synthesis via Probabilistic Relational Models

Malte Luttermann,Ralf Möller,Mattis Hartwig
2024-10-03
Abstract:Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of artificial intelligence requires increasingly large amounts of relational training data for various machine learning tasks. Collecting real-world data, however, is often challenging due to privacy concerns, data protection regulations, high costs, and so on. To mitigate these challenges, the generation of synthetic data is a promising approach. In this paper, we solve the problem of generating synthetic relational data via probabilistic relational models. In particular, we propose a fully-fledged pipeline to go from relational database to probabilistic relational model, which can then be used to sample new synthetic relational data points from its underlying probability distribution. As part of our proposed pipeline, we introduce a learning algorithm to construct a probabilistic relational model from a given relational database.
Artificial Intelligence,Databases,Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to generate synthetic relational data through Probabilistic Relational Models (PRMs) in order to address challenges such as privacy issues, data protection regulations, and high costs when collecting real - relational data in the real world. Specifically, the author proposes a complete process from relational databases to Probabilistic Relational Models, so that new synthetic relational data points can be generated based on the underlying probability distribution. ### Detailed Explanation 1. **Background and Motivation** - **Privacy Issues**: In many application scenarios, such as medical records and financial transactions, directly using real data may disclose sensitive information and violate privacy regulations. - **Data Requirements**: Machine - learning tasks require a large amount of training data, but collecting real data is often fraught with difficulties. - **Advantages of Synthetic Data**: Synthetic data can provide sufficient training samples without violating personal privacy while maintaining the authenticity and complexity of the data. 2. **Method Overview** - **Probabilistic Relational Models (PRMs)**: Combining first - order logic and probability models, they can represent the relationships between objects and compactly encode the joint probability distribution. - **Process from Relational Database to PRM**: 1. **Constructing a Propositional Factor Graph (FG)**: First, learn a propositional factor graph (Factor Graph) from a given relational database. This process includes identifying variable nodes, factor nodes, and edges. 2. **Converting to a Parameterized Factor Graph (PFG)**: Use the Advanced Color Passing (ACP) algorithm to convert the propositional factor graph to a parameterized factor graph, abstracting indistinguishable groups of objects. 3. **Sampling to Generate Synthetic Data**: Sample new synthetic data points from the PFG, and these data points follow the underlying joint probability distribution. 3. **Technical Innovation Points** - **Handling Individual Objects and Relationships**: Traditional factor graph learning algorithms usually ignore the structural characteristics of data, while the process proposed in this paper can retain the information of objects and their relationships. - **Privacy Protection**: By grouping indistinguishable objects, the PFG naturally provides the basis for differential privacy, reducing the dependence on individual data. - **Interpretability**: The PFG can be used not only for generating synthetic data but also for probabilistic reasoning and causal reasoning, and has strong interpretability. 4. **Application Prospects** - **Training Machine - Learning Models**: The generated synthetic data can be used to train various machine - learning models, especially in fields with strict privacy requirements. - **Data Sharing**: Synthetic data can be safely shared without violating privacy. - **Guiding PFG Learning**: Use existing databases to generate more synthetic data to further optimize the PFG learning process. ### Formula Examples The formulas involved in the paper are as follows: - **Semantics of Factor Graph**: \[ P_G=\frac{1}{Z}\prod_{j = 1}^{m}\phi_j(A_j) \] where \(Z\) is the normalization constant, and \(A_j\) represents the random variables connected to the factor node \(\phi_j\). - **Semantics of Parameterized Factor Graph**: \[ P_G=\frac{1}{Z}\prod_{g_j\in G}\prod_{\phi_k\in \text{gr}(g_j)}\phi_k(A_k) \] Through these formulas, the paper shows how to learn from relational databases and generate synthetic data that conforms to the underlying probability distribution, thus resolving the contradiction between privacy protection and data requirements.