Abstract:Semi-continuous data are very common in social sciences and economics. In this paper, a Bayesian variable selection procedure is developed to assess the influence of observed and/or unobserved exogenous factors on semi-continuous data. Our formulation is based on a two-part latent variable model with polytomous responses. We consider two schemes for the penalties of regression coefficients and factor loadings: a Bayesian spike and slab bimodal prior and a Bayesian lasso prior. Within the Bayesian framework, we implement a Markov chain Monte Carlo sampling method to conduct posterior inference. To facilitate posterior sampling, we recast the logistic model from Part One as a norm-type mixture model. A Gibbs sampler is designed to draw observations from the posterior. Our empirical results show that with suitable values of hyperparameters, the spike and slab bimodal method slightly outperforms Bayesian lasso in the current analysis. Finally, a real example related to the Chinese Household Financial Survey is analyzed to illustrate application of the methodology.
What problem does this paper attempt to address?
The paper primarily aims to address the statistical analysis of semicontinuous data (data with an excessive number of zeros) in the fields of social sciences and economics. Specifically, the study proposes a Bayesian-based feature extraction procedure to assess the impact of observed and unobserved exogenous factors on semicontinuous data. To this end, the authors developed a two-part latent variable model that can handle polytomous manifestations and considered two penalization schemes: one is the Bayesian spike and slab bimodal prior, and the other is the Bayesian Lasso prior.
To achieve this goal, the paper employs the following key steps and techniques:
1. **Model Construction**: A method combining a two-part model with a latent variable model is proposed, where one part is used to model binary responses (e.g., whether an event occurs), and the other part is used to model non-zero continuous responses. This model allows for the simultaneous handling of binary, continuous, and categorical data.
2. **Feature Selection**: Feature selection is performed using the spike and slab bimodal prior or the Bayesian Lasso prior within the Bayesian framework to determine which explanatory variables significantly contribute to the model fit.
3. **Posterior Inference**: Markov Chain Monte Carlo (MCMC) sampling methods, particularly the Gibbs sampler, are used for posterior inference to estimate model parameters and perform variable selection.
4. **Model Validation**: The proposed model's effectiveness is validated through simulation studies, and its application value is demonstrated with a real case study on the Chinese Household Finance Survey.
The research results show that, with appropriate hyperparameter settings, the spike and slab bimodal method slightly outperforms the Bayesian Lasso method in the current analysis. This indicates that the proposed method has certain advantages in handling semicontinuous datasets with a large number of exogenous factors.