Semisupervised inference for explained variance in high dimensional linear regression and its applications

T. Tony Cai,Zijian Guo
DOI: https://doi.org/10.1111/rssb.12357
2020-01-20
Abstract:<p>The paper considers statistical inference for the explained variance under the high dimensional linear model <i>Y</i>=<i>Xβ</i>+<i>ε</i> in the semisupervised setting, where <i>β</i> is the regression vector and Σ is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax optimal rate of convergence in the general semisupervised framework. The optimality result characterizes how the unlabelled data contribute to the estimation accuracy. Moreover, the limiting distribution for the proposed estimator is established and the unlabelled data have also proved useful in reducing the length of the confidence interval for the explained variance. The method proposed is extended to semisupervised inference for the unweighted quadratic functional . The inference results obtained are then applied to a range of high dimensional statistical problems, including signal detection and global testing, prediction accuracy evaluation and confidence ball construction. The numerical improvement of incorporating the unlabelled data is demonstrated through simulation studies and an analysis of estimating heritability for a yeast segregant data set with multiple traits.</p>
statistics & probability
What problem does this paper attempt to address?