Dylan Clark-Boucher,Jeffrey W. Miller
Abstract:Principal variables analysis (PVA) is a technique for selecting a subset of variables that capture as much of the information in a dataset as possible. Existing approaches for PVA are based on the Pearson correlation matrix, which is not well-suited to describing the relationships between non-Gaussian variables. We propose a generalized approach to PVA enabling the use of different types of correlation, and we explore using Spearman, Gaussian copula, and polychoric correlations as alternatives to Pearson correlation when performing PVA. We compare performance in simulation studies varying the form of the true multivariate distribution over a wide range of possibilities. Our results show that on continuous non-Gaussian data, using generalized PVA with Gaussian copula or Spearman correlations provides a major improvement in performance compared to Pearson. Meanwhile, on ordinal data, generalized PVA with polychoric correlations outperforms the rest by a wide margin. We apply generalized PVA to a dataset of 102 clinical variables measured on individuals with X-linked dystonia parkinsonism (XDP), a rare neurodegenerative disorder, and we find that using different types of correlation yields substantively different sets of principal variables.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that when performing Principal Variables Analysis (PVA) in non - Gaussian data, the existing methods based on the Pearson correlation matrix have limitations. Specifically, the Pearson correlation coefficient is not natural enough in describing the relationships between non - Gaussian variables, especially in the cases of binary variables, variables highly dependent on the latent space, and low correlations in the observed data due to discretization or other transformations in the measurement process. Therefore, the paper proposes a generalized PVA method that can use different correlation types (such as Spearman correlation, Gaussian Copula correlation, and multivariate correlation) to better handle non - Gaussian data, especially continuous and ordinal data.
### Main Contributions
1. **Proposing the Generalized PVA Method**: The paper proposes a new PVA method. By replacing the traditional Pearson correlation matrix with other types of association matrices (such as Spearman, Gaussian Copula, and multivariate correlation), it improves the performance on non - Gaussian data.
2. **Simulation Studies**: Through extensive simulation studies, the performance of different correlation types under different data distributions is compared. The results show that on continuous non - Gaussian data, using Gaussian Copula or Spearman correlation significantly improves the performance; while on ordinal data, using multivariate correlation has the best effect.
3. **Practical Applications**: The generalized PVA method is applied to a data set containing 102 clinical variables from individuals with X - linked Dystonia - Parkinsonism (XDP). The results show that using different correlation types can obtain substantially different sets of principal variables.
### Method Overview
- **PVA Algorithm**: Based on McCabe's "explained variance" criterion, select a subset that can maximize the variance of other variables explained. The algorithm is achieved by minimizing the trace of the conditional covariance matrix given the selected variables.
- **Generalized PVA**: Three alternative correlation types are introduced:
- **Spearman Correlation**: A rank - based correlation coefficient that is invariant to monotonic transformations.
- **Multivariate Correlation**: Applicable to ordinal variables, assuming that the variables have an underlying multivariate Gaussian structure.
- **Gaussian Copula Correlation**: Can capture any underlying correlation structure and marginal distribution.
### Simulation Studies
- **Data Generation**: Generate latent variables from a multivariate Gaussian distribution and generate observed data through monotonic function transformations to evaluate the impact of non - Gaussianity on PVA performance.
- **Performance Evaluation**: Performance is evaluated by two indicators:
- **Proportion of Ideal Variable Selection**: Compare the overlap between the variables selected by different methods and the ideal variables.
- **Relative Explanatory Efficiency (REE)**: Measure the efficiency of the selected variable set in explaining the variance of the omitted variables relative to the ideal variable set.
### Results
- **Non - Transformed Data**: When the data is multivariate Gaussian, the Pearson, Spearman, and Gaussian Copula methods perform similarly, all close to 100%.
- **Continuously Transformed Data**: The Spearman and Gaussian Copula methods perform better, while the Pearson method performs worse.
- **Ordinally Transformed Data**: The multivariate correlation method performs the best, while the other methods perform worse.
### Practical Applications
- **XDP Data Set**: Applying the generalized PVA method, it is found that using different correlation types can obtain different sets of principal variables, which helps to better understand the characteristics of the disease.
In conclusion, by proposing the generalized PVA method, this paper solves the limitations of existing methods in handling non - Gaussian data and provides a more flexible and powerful tool for data analysis.