Abstract:Background: Developing binary classification rules based on SNP observations has been a major challenge for many modern bioinformatics applications, e.g., predicting risk of future disease events in complex conditions such as cancer. Small-sample, high-dimensional nature of SNP data, weak effect of each SNP on the outcome, and highly non-linear SNP interactions are several key factors complicating the analysis. Additionally, SNPs take a finite number of values which may be best understood as ordinal or categorical variables, but are treated as continuous ones by many algorithms. Methods: We use the theory of high dimensional model representation (HDMR) to build appropriate low dimensional glass-box models, allowing us to account for the effects of feature interactions. We compute the second order HDMR expansion of the log-likelihood ratio to account for the effects of single SNPs and their pairwise interactions. We propose a regression based approach, called linear approximation for block second order HDMR expansion of categorical observations (LABS-HDMR-CO), to approximate the HDMR coefficients. We show how HDMR can be used to detect pairwise SNP interactions, and propose the fixed pattern test (FPT) to identify statistically significant pairwise interactions. Results: We apply LABS-HDMR-CO and FPT to synthetically generated HAPGEN2 data as well as to two GWAS cancer datasets. In these examples LABS-HDMR-CO enjoys superior accuracy compared with several algorithms used for SNP classification, while also taking pairwise interactions into account. FPT declares very few significant interactions in the small sample GWAS datasets when bounding false discovery rate (FDR) by 5%, due to the large number of tests performed. On the other hand, LABS-HDMR-CO utilizes a large number of SNP pairs to improve its prediction accuracy. In the larger HAPGEN2 dataset FTP declares a larger portion of SNP pairs used by LABS-HDMR-CO as significant. Conclusion: LABS-HDMR-CO and FPT are interesting methods to design prediction rules and detect pairwise feature interactions for SNP data. Reliably detecting pairwise SNP interactions and taking advantage of potential interactions to improve prediction accuracy are two different objectives addressed by these methods. While the large number of potential SNP interactions may result in low power of detection, potentially interacting SNP pairs, of which many might be false alarms, can still be used to improve prediction accuracy.

SLIM: a sliding linear model for estimating the proportion of true null hypotheses in datasets with dependence structures

A statistical method for the conservative adjustment of false discovery rate (q-value)

Consistent estimation of the proportion of false nulls and FDR for adaptive multiple testing Normal means under weak dependence

SLIDE: Significant Latent Factor Interaction Discovery and Exploration across biological domains

Optimal False Discovery Rate Control for Large Scale Multiple Testing with Auxiliary Information

Assumption-Lean Post-Integrated Inference with Negative Control Outcomes

A Change-Point Approach to Estimating the Proportion of False Null Hypotheses in Multiple Testing

Simultaneous high-probability bounds on the false discovery proportion in structured, regression, and online settings

Simultaneous inference for generalized linear models with unmeasured confounders

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data

Estimating the proportion of true null hypotheses and adaptive false discovery rate control in discrete paradigm

Randomized p-values for multiple testing and their application in replicability analysis

SL2MF: Predicting Synthetic Lethality in Human Cancers Via Logistic Matrix Factorization

Imputation of truncated p-values for meta-analysis methods and its genomic application

A computational model for sample dependence in hypothesis testing of genome data

High dimensional model representation of log likelihood ratio: binary classification with SNP data

SL$^2$MF: Predicting Synthetic Lethality in Human Cancers via Logistic Matrix Factorization

JUMP: replicability analysis of high-throughput experiments with applications to spatial transcriptomic studies

Directional FDR Control for Sub-Gaussian Sparse GLMs

Split Knockoffs for Multiple Comparisons: Controlling the Directional False Discovery Rate

Estimation of Statistical Power and False Discovery Rate of QTL Mapping Methods Through Computer Simulation