Optimizing Variable Selection in Phylogenetic Eigenvector Regression for Trait Correlation Analysis

Zheng-Lin Chen,Deng-Ke Niu
DOI: https://doi.org/10.1101/2024.04.14.589420
2024-11-05
Abstract:Phylogenetic autocorrelation refers to the similarity in traits among closely related species due to shared evolutionary history, violating the assumption of independence in conventional statistical analyses. Phylogenetic eigenvector regression (PVR) addresses this issue by incorporating phylogenetic eigenvectors derived from phylogenetic trees to control for shared ancestry and isolate trait correlations. In this study, we investigated how different selections of dependent and independent variables affect eigenvector identification and trait correlation results. We used simulated data to analyze both fixed-balanced and random phylogenetic trees, which allowed us to record internal node data. This enabled us to determine the true correlation between traits using traditional Pearson and Spearman rank correlations, providing a gold standard for comparison. We found that the eigenvectors selected differed significantly when switching between dependent variables ( and ), which led to inconsistent PVR correlation results. We assessed three criteria- , Pagel's λ, and AIC-to determine which dependent variable produced more accurate results. Our results indicate that, under scenarios with significant phylogenetic autocorrelation and larger sample sizes (128 and 1024 species), the criterion of maximizing consistently provided the more accurate model. This was true across multiple evolutionary models, tree structures, and regression estimators, including L2 (OLS), MM, and L1. However, in smaller samples (16 species), Pagel's λ was more effective as a selection criterion. These findings emphasize the importance of using appropriate criteria, such as or λ, for selecting dependent variables to enhance the accuracy and consistency of evolutionary inferences. Additionally, our preliminary findings suggest that similar issues of variable selection may affect other spatial statistical tools, such as spatial eigenvector mapping, conditional autoregressive models, simultaneous autoregressive models, and generalized additive models. These insights extend beyond phylogenetic analyses to other contexts where spatial autocorrelation exists or coexists with phylogenetic autocorrelation. Future work is needed to address how to appropriately select dependent and independent variables in these contexts, which remains a key challenge for understanding evolutionary and ecological relationships.
Evolutionary Biology
What problem does this paper attempt to address?