A Bayesian approach for clustering skewed data using mixtures of multivariate normal-inverse Gaussian distributions

Yuan Fang,Dimitris Karlis,Sanjeena Subedi
DOI: https://doi.org/10.48550/arXiv.2005.02585
2020-05-06
Abstract:Non-Gaussian mixture models are gaining increasing attention for mixture model-based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. Here, such a mixture distribution is presented, based on the multivariate normal inverse Gaussian (MNIG) distribution. For parameter estimation of the mixture, a Bayesian approach via Gibbs sampler is used; for this, a novel approach to simulate univariate generalized inverse Gaussian random variables and matrix generalized inverse Gaussian random matrices is provided. The proposed algorithm will be applied to both simulated and real data. Through simulation studies and real data analysis, we show parameter recovery and that our approach provides competitive clustering results compared to other clustering approaches.
Computation
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to conduct effective cluster analysis when dealing with data with skewness and heavy - tailed characteristics. Specifically, the paper proposes a mixture model based on the multivariate normal - inverse Gaussian (MNIG) distribution and uses the Bayesian method to perform parameter estimation through Gibbs sampling. The traditional Gaussian mixture model can only model symmetric elliptical data, while the MNIG mixture model proposed in this paper can represent skewed and symmetric components more flexibly, thus providing more accurate clustering results. ### Main Contributions 1. **Proposing a Mixture Model Based on MNIG Distribution**: The paper introduces the multivariate normal - inverse Gaussian (MNIG) distribution as the basis of the mixture model to handle skewed and heavy - tailed data. 2. **Bayesian Parameter Estimation**: The Bayesian method is used to perform parameter estimation through Gibbs sampling, which solves the problems of slow convergence and unstable results in the traditional EM algorithm. 3. **Novel Random Variable Generation Method**: New methods for generating one - dimensional generalized inverse Gaussian (GIG) random variables and matrix generalized inverse Gaussian (MGIG) random matrices are provided, and these methods are very suitable in the MCMC framework. 4. **Performance Evaluation**: Through simulation studies and real - data analysis, the effectiveness and competitiveness of the proposed method are demonstrated. ### Key Technologies - **Multivariate Normal - inverse Gaussian (MNIG) Distribution**: This is a mean - variance mixture distribution that combines the characteristics of the multivariate normal distribution and the inverse Gaussian distribution and is suitable for modeling skewed and heavy - tailed data. - **Bayesian Method**: Parameter estimation is performed through Gibbs sampling, and prior and posterior distributions are used to infer model parameters. - **Gibbs Sampling**: A Monte Carlo Markov Chain (MCMC) method used to sample from complex posterior distributions. - **Model Selection**: The Bayesian Information Criterion (BIC) is used for model selection to determine the optimal number of clusters. ### Simulation Studies and Real - Data Analysis - **Simulation Studies**: By generating two - dimensional and four - dimensional data sets with skewness and heavy - tailed characteristics, the clustering performance of the proposed method is verified. The results show that the proposed method can accurately recover parameters and obtain a relatively high Adjusted Rand Index (ARI). - **Real - Data Analysis**: The proposed method is applied to the Old Faithful data set and the Fish Catch data set, demonstrating its effectiveness and competitiveness in practical problems. ### Conclusion The paper proposes a mixture model based on the multivariate normal - inverse Gaussian distribution and its Bayesian parameter estimation method, which can provide effective clustering results when dealing with skewed and heavy - tailed data. Through simulation studies and real - data analysis, the effectiveness and superiority of the proposed method are proved.