Bayesian mixture modeling for multivariate conditional distributions

Maria DeYoreo,Jerome P. Reiter
DOI: https://doi.org/10.48550/arXiv.1606.04457
2016-07-14
Abstract:We present a Bayesian mixture model for estimating the joint distribution of mixed ordinal, nominal, and continuous data conditional on a set of fixed variables. The model uses multivariate normal and categorical mixture kernels for the random variables. It induces dependence between the random and fixed variables through the means of the multivariate normal mixture kernels and via a truncated local Dirichlet process. The latter encourages observations with similar values of the fixed variables to share mixture components. Using a simulation of data fusion, we illustrate that the model can estimate underlying relationships in the data and the distributions of the missing values more accurately than a mixture model applied to the random and fixed variables jointly. We use the model to analyze consumers' reading behaviors using a quota sample, i.e., a sample where the empirical distribution of some variables is fixed by design and so should not be modeled as random, conducted by the book publisher HarperCollins.
Methodology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to estimate the joint distribution of mixed data (including ordinal, nominal and continuous variables) given a set of fixed variables. Specifically, the paper proposes a Bayesian mixture model for estimating the conditional distribution of other variables (called random variables) given certain variables (called fixed variables). This model is particularly suitable for handling data collected in stratified or quota sampling designs, where the empirical distributions of certain variables are pre - fixed and thus should not be modeled as random variables. The main contributions of the paper are as follows: 1. **Model flexibility**: A Bayesian mixture model that can flexibly handle different types of data (ordinal, nominal and continuous) is proposed. 2. **Conditional dependence modeling**: Through the multivariate normal mixture kernel and the truncated local Dirichlet process, the model can capture the dependence relationships between random variables and fixed variables. 3. **Data fusion and missing value handling**: Through simulation experiments, the superior performance of this model in data fusion and missing value imputation is demonstrated, especially when dealing with complex data structures. The paper further verifies the effectiveness of the model through a practical case - analyzing the reader behavior data of HarperCollins Publishers. In this case, the researchers attempt to understand individual reading behaviors and interests, such as distinguishing the characteristics of people who own e - books and those who do not. In conclusion, this paper aims to provide an effective statistical tool for more accurately estimating and analyzing the joint distribution of mixed data in the presence of fixed variables.