Abstract:E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for "stratified" multiple hypothesis testing problems, controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which FDRs are greatly underestimated due to weaknesses in random sequence models. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, motif scanning, and multi-microarray analyses.

Empirical Null Estimation using Discrete Mixture Distributions and its Application to Protein Domain Data

Beyond the E-value: stratified statistics for protein domain prediction

Estimating the null distribution for conditional inference and genome-scale screening

Quantification of the effect of mutations using a global probability model of natural sequence variation

False Discovery Rate Controlling Procedures with BLOSUM62 substitution matrix and their application to HIV Data

Bagged Empirical Null p-values: A Method to Account for Model Uncertainty in Large Scale Inference

Optimal False Discovery Rate Control for Large Scale Multiple Testing with Auxiliary Information

Double truncation method for controlling local false discovery rate in case of spiky null

A Gene Selection Method for GeneChip Array Data with Small Sample Sizes

Protein Discovery with Discrete Walk-Jump Sampling

Estimating the number and effect sizes of non-null hypotheses

Assessment of false discovery rate control in tandem mass spectrometry analysis using entrapment

Large-Scale Simultaneous Testing Using Kernel Density Estimation

Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing

MixTwice: large-scale hypothesis testing for peptide arrays by variance mixing

Isolating selective from non-selective forces using site frequency ratios

Exploratory data analysis for large-scale multiple testing problems and its application in gene expression studies

Nonparametric Bayes multiresolution testing for high-dimensional rare events

Data-Error Scaling in Machine Learning on Natural Discrete Combinatorial Mutation-prone Sets: Case Studies on Peptides and Small Molecules

Multiple Testing with Heterogeneous Multinomial Distributions

Accurate and Fast Small P-Value Estimation for Permutation Tests in High-Throughput Genomic Data Analysis with the Cross-Entropy Method.