Bayesian network Motifs for reasoning over heterogeneous unlinked datasets

Yi Sui,Alex Kwan,Alexander W. Olson,Scott Sanner,Daniel A. Silver
DOI: https://doi.org/10.1007/s10618-024-01054-7
IF: 5.406
2024-08-19
Data Mining and Knowledge Discovery
Abstract:Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.
computer science, information systems, artificial intelligence
What problem does this paper attempt to address?