Generalized Pearson correlation squares for capturing mixtures of bivariate linear dependences

Jingyi Jessica Li,Xin Tong,Peter J. Bickel
DOI: https://doi.org/10.48550/arXiv.1811.09965
2020-06-30
Abstract:Motivated by the pressing needs for capturing complex but interpretable variable relationships in scientific research, here we generalize the squared Pearson correlation to capture a mixture of linear dependences between two real-valued random variables, with or without an index variable that specifies the line memberships. We construct generalized Pearson correlation squares by focusing on three aspects: the exchangeability of the two variables, the independence of parametric model assumptions, and the availability of population-level parameters. For the computation of the generalized Pearson correlation square from a sample without line-membership specification, we develop a K-lines clustering algorithm, where K, the number of lines, can be chosen in a data-adaptive way. With our defined population-level generalized Pearson correlation squares, we derive the asymptotic distributions of the sample-level statistics to enable efficient statistical inference. Simulation studies verify the theoretical results and compare the generalized Pearson correlation squares with other widely-used association measures in terms of power. Gene expression data analysis demonstrates the effectiveness of the generalized Pearson correlation squares in capturing interpretable gene-gene relationships missed by other measures. We implement the estimation and inference procedures in an R package gR2.
Methodology
What problem does this paper attempt to address?