Cross-trait prediction accuracy of high-dimensional ridge-type estimators in genome-wide association studies

Bingxin Zhao,Hongtu Zhu
DOI: https://doi.org/10.48550/arXiv.1911.10142
2019-11-22
Methodology
Abstract:Marginal association summary statistics have attracted great attention in statistical genetics, mainly because the primary results of most genome-wide association studies (GWAS) are produced by marginal screening. In this paper, we study the prediction accuracy of marginal estimator in dense (or sparsity free) high-dimensional settings with $(n,p,m) \to \infty$, $m/n \to \gamma \in (0,\infty)$, and $p/n \to \omega \in (0,\infty)$. We consider a general correlation structure among the $p$ features and allow an unknown subset $m$ of them to be signals. As the marginal estimator can be viewed as a ridge estimator with regularization parameter $\lambda \to \infty$, we further investigate a class of ridge-type estimators in a unifying framework, including the popular best linear unbiased prediction (BLUP) in genetics. We find that the influence of $\lambda$ on out-of-sample prediction accuracy heavily depends on $\omega$. Though selecting an optimal $\lambda$ can be important when $p$ and $n$ are comparable, it turns out that the out-of-sample $R^2$ of ridge-type estimators becomes near-optimal for any $\lambda \in (0,\infty)$ as $\omega$ increases. For example, when features are independent, the out-of-sample $R^2$ is always bounded by $1/\omega$ from above and is largely invariant to $\lambda$ given large $\omega$ (say, $\omega>5$). We also find that in-sample $R^2$ has completely different patterns and depends much more on $\lambda$ than out-of-sample $R^2$. In practice, our analysis delivers useful messages for genome-wide polygenic risk prediction and computation-accuracy trade-off in dense high-dimensions. We numerically illustrate our results in simulation studies and a real data example.
What problem does this paper attempt to address?