Optimal control of false discovery criteria in the two-group model

Ruth Heller,Saharon Rosset
DOI: https://doi.org/10.1111/rssb.12403
2020-05-28
Abstract:The highly influential two-group model in testing a large number of statistical hypotheses assumes that the test statistics are drawn independently from a mixture of a high probability null distribution and a low probability alternative. Optimal control of the marginal false discovery rate (mFDR), in the sense that it provides maximal power (expected true discoveries) subject to mFDR control, is known to be achieved by thresholding the local false discovery rate (locFDR), i.e., the probability of the hypothesis being null given the set of test statistics, with a fixed threshold. We address the challenge of controlling optimally the popular false discovery rate (FDR) or positive FDR (pFDR) rather than mFDR in the general two-group model, which also allows for dependence between the test statistics. These criteria are less conservative than the mFDR criterion, so they make more rejections in expectation. We derive their optimal multiple testing (OMT) policies, which turn out to be thresholding the locFDR with a threshold that is a function of the entire set of statistics. We develop an efficient algorithm for finding these policies, and use it for problems with thousands of hypotheses. We illustrate these procedures on gene expression studies.
Statistics Theory
What problem does this paper attempt to address?
This paper aims to solve the problem of how to optimally control the False Discovery Rate (FDR), Positive False Discovery Rate (pFDR), or Marginal False Discovery Rate (mFDR) in large - scale hypothesis testing. Specifically, the paper focuses on how to design an Optimal Multiple Testing (OMT) strategy under a more general two - group model (allowing for dependencies between test statistics and different marginal distributions) to maximize the expected number of true discoveries while controlling FDR, pFDR, or mFDR. ### Background and Problem Description In large - scale inference problems, it is usually necessary to test hundreds or even thousands of hypotheses simultaneously to identify the set of non - zero hypotheses. Such problems are common in fields such as medicine, genetics, particle physics, ecology, and psychology. To ensure that not too many false positive results are mixed into the findings, Benjamini and Hochberg (1995) introduced the False Discovery Rate (FDR) as an error metric. FDR is more lenient than the traditional Family - Wise Error Rate (FWER) and has therefore been widely used in large - scale testing. ### Research Objectives The main research objective of the paper is to develop multiple - testing strategies that can optimally control FDR, pFDR, or mFDR under a more general two - group model. These strategies need not only to control the false discovery rate but also to ensure as many true discoveries as possible. Specifically, the paper addresses the following key issues: 1. **Optimal control of FDR and pFDR**: Under the general two - group model, how to design an optimal multiple - testing strategy to control FDR or pFDR while maximizing the expected number of true discoveries. 2. **Handling of dependency structures**: How to efficiently calculate the local False Discovery Rate (locFDR) when there are dependencies between test statistics and design the corresponding optimal strategy. 3. **Algorithm implementation**: Develop efficient algorithms that can handle the testing of thousands of hypotheses in practical applications. ### Main Contributions 1. **Theoretical results**: It is proved that under the control of FDR or pFDR, the optimal multiple - testing strategy is to threshold the locFDR, and the threshold is a function of the entire data set. This is different from the previous research results on mFDR control, which uses a fixed threshold. 2. **Algorithm development**: An efficient algorithm is proposed to find the optimal multiple - testing strategy. This algorithm is based on a relaxed form of infinite - dimensional linear programming and finds the optimal solution by solving the Euler - Lagrange conditions. 3. **Numerical experiments**: The effectiveness of the proposed optimal strategy is verified through numerical experiments. The experimental results show that in the presence of known dependencies, using these dependencies can significantly improve the power of the test; even in the case of thousands of hypotheses, the power of FDR and pFDR control is significantly better than that of mFDR control. ### Conclusions The paper provides a set of theoretical and algorithmic frameworks for optimally controlling FDR, pFDR, or mFDR under a more general two - group model. These methods are not only of great theoretical significance but also of high practical value in practical applications, especially in large - scale inference problems such as gene expression research.