Abstract:It is well established that phenotype and genotype misclassification errors reduce the power to detect genetic association. Resampling a subset of the data (e.g, double-sampling) of genotype and/or phenotype with a gold standard measurement is one method to address this issue. We derive the non-centrality parameter (NCP) for the recently published Likelihood Ratio Test Allowing for Error (LRTae) in the presence of random phenotype and genotype errors. With the NCP, power and sample size can be analytically determined at any significance level. We verify analytic power with simulations using a 2**k factorial design given high and low settings of: case and control genotype frequencies, phenotype and genotype misclassification probabilities, total sample size, ratio of cases to controls, and proportions of phenotype and/or genotype doublesamples. We also perform example applications of our method assuming equal costs for the LRTae method and the standard method that does not use double-sample information (LRTstd) to determine if power gain due to double-sampling a proportion of samples outweighs the reduction in sample size due to additional costs in obtaining double-samples. Our results showed a median difference of at most 0.01 between analytic and simulation power for the factorial design settings, with maximum difference of 0.054. For our cost/benefits analysis calculations, results for genotype errors are that double-sampling appears most beneficial (in terms of power gain) when cost of double-sampling is relatively low, irrespective of the proportion of individuals double-sampled. In the presence of phenotype error, there is always power gain using the LRTae method for the parameter settings considered. We have freely available software that performs power and sample size calculations for the LRTae method and cost/benefits analyses comparing power for LRTae and LRTstd methods assuming equal costs.

The hidden factor: accounting for covariate effects in power and sample size computation for a binary trait

Sample Size and Power Calculations with Correlated Binary Data

Detecting latent gene-environment interaction when analyzing binary traits

Power Analysis of Principal Components Regression in Genetic Association Studies.

Correlation of Population Parameters Leading to Power Differences in Association Studies with Population Stratification

Computing Asymptotic Power and Sample Size for Case-Control Genetic Association Studies in the Presence of Phenotype And/or Genotype Misclassification Errors.

Sample size and optimal design for logistic regression with binary interaction

Linkage Analysis of Longitudinal Data and Design Consideration.

Regression Analysis of Dependent Binary Data for Estimating Disease Etiology from Case-Control Studies

Gene-Based Association Analysis for Censored Traits Via Fixed Effect Functional Regressions

Impact of genotyping errors on statistical power of association tests in genomic analyses: A case study.

A Comparison of Approaches to Control for Confounding Factors by Regression Models

A Penalization Method for Estimating Heterogeneous Covariate Effects in Cancer Genomic Data.

A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology

A Varying Coefficient Model to Jointly Test Genetic and Gene–Environment Interaction Effects

Sample size determination for logistic regression revisited

Efficient Algorithms for Covariate Analysis with Dynamic Data Using Nonlinear Mixed-Effects Model.

Univariate/Multivariate Genome-Wide Association Scans Using Data from Families and Unrelated Samples

Accounting for unobserved covariates with varying degrees of estimability in high dimensional biological data

Interpretation of two-sample Mendelian randomization for binary exposures and outcome

A Regression-based Approach to Robust Estimation and Inference for Genetic Covariance