Principal component analysis balancing prediction and approximation accuracy for spatial data

Si Cheng,Magali N. Blanco,Timothy V. Larson,Lianne Sheppard,Adam Szpiro,Ali Shojaie
DOI: https://doi.org/10.48550/arXiv.2408.01662
2024-09-09
Abstract:Dimension reduction is often the first step in statistical modeling or prediction of multivariate spatial data. However, most existing dimension reduction techniques do not account for the spatial correlation between observations and do not take the downstream modeling task into consideration when finding the lower-dimensional representation. We formalize the closeness of approximation to the original data and the utility of lower-dimensional scores for downstream modeling as two complementary, sometimes conflicting, metrics for dimension reduction. We illustrate how existing methodologies fall into this framework and propose a flexible dimension reduction algorithm that achieves the optimal trade-off. We derive a computationally simple form for our algorithm and illustrate its performance through simulation studies, as well as two applications in air pollution modeling and spatial transcriptomics.
Methodology,Computation,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the dimension reduction problem in statistical modeling or prediction of multivariate spatial data. Specifically, most of the existing dimension reduction techniques do not take into account the spatial correlation between observations, and also do not consider the requirements of downstream modeling tasks when looking for low - dimensional representations. This has led to two main problems: 1. **Representability**: How well the low - dimensional components approximate the original data. 2. **Predictability**: Whether the low - dimensional components can retain meaningful scientific and spatial information when inferred to unmeasured locations. The paper proposes a new flexible dimension reduction algorithm, called Representability and Predictability - based Principal Component Analysis (RapPCA), which aims to optimize representability and predictability simultaneously. RapPCA finds the optimal balance between the two by minimizing the combination of representation error and prediction error. In addition, RapPCA allows the low - dimensional scores to have a complex spatial structure and a non - linear relationship with external covariates. ### Main contributions of the paper 1. **Formalize the trade - off between representability and predictability**: The paper formalizes representability and predictability as two complementary but sometimes conflicting metrics. 2. **Propose the RapPCA algorithm**: This algorithm relaxes the constraints on low - dimensional scores by introducing a penalty term, making them close to but not exactly in a specific model space, thereby achieving the optimal balance between representability and predictability. 3. **Theoretical and empirical verification**: The paper proves the optimality of the RapPCA algorithm through theoretical analysis, and demonstrates its performance through simulation studies and practical applications (such as air pollution modeling and spatial transcriptomics). ### Key methods and techniques - **Classical PCA and Predictive PCA**: The paper reviews the basic principles of classical PCA and predictive PCA, pointing out that classical PCA mainly focuses on representability, while predictive PCA gives priority to predictability. - **RapPCA optimization problem**: RapPCA extracts principal components by solving an optimization problem that includes representation error, prediction error, and a regularization term. The form of the optimization problem is as follows: \[ \min_{u, v, \alpha, \beta} f_{\gamma, \lambda_1, \lambda_2}(u, v, \alpha, \beta) := \|Y^{(l)} - uv^\top\|_F^2 + \gamma \|u - (K\alpha + B\beta)\|_2^2 + \lambda_1 \alpha^\top \tilde{K} \alpha + \lambda_2 \beta^\top \tilde{Q} \beta \] where \( Y^{(l)} \) is the residual matrix after the \( l \) - th iteration, \( K \) is the kernel matrix, \( B \) is the spline basis function matrix, \( \tilde{K} \) and \( \tilde{Q} \) are matrices with small diagonal terms added to avoid near - singularity problems, and \( \gamma \), \( \lambda_1 \) and \( \lambda_2 \) are tuning parameters. - **Algorithm implementation**: The paper proposes a computationally simple algorithm to solve the above optimization problem and selects the optimal tuning parameters through cross - validation. ### Applications and verification - **Simulation studies**: The paper shows the advantages of RapPCA over classical PCA and predictive PCA in different situations through simulation studies in three different scenarios. - **Practical applications**: The paper further verifies the effectiveness and practicality of RapPCA through practical applications in air pollution modeling and spatial transcriptomics. In conclusion, by proposing the RapPCA algorithm, this paper solves the trade - off problem between representability and predictability in dimension reduction of multivariate spatial data, providing a more effective tool for subsequent statistical modeling and prediction.