Abstract:The goal of eQTL studies is to identify the genetic variants that influence the expression levels of the genes in an organism. High throughput technology has made such studies possible: in a given tissue sample, it enables us to quantify the expression levels of approximately 20,000 genes and to record the alleles present at millions of genetic polymorphisms. While obtaining this data is relatively cheap once a specimen is at hand, obtaining human tissue remains a costly endeavor. Thus, eQTL studies continue to be based on relatively small sample sizes, with this limitation particularly serious for tissues of most immediate medical relevance. Given the high dimensional nature of this datasets and the large number of hypotheses tested, the scientific community has adopted early on multiplicity adjustment procedures, which primarily control the false discoveries rate for the identification of genetic variants with influence on the expression levels. In contrast, a problem that has not received much attention to date is that of providing estimates of the effect sizes associated to these variants, in a way that accounts for the considerable amount of selection. We illustrate how the recently developed conditional inference approach can be deployed to obtain confidence intervals for the eQTL effect sizes with reliable coverage. The procedure we propose is based on a randomized hierarchical strategy that both reflects the steps typically adopted in state of the art investigations and introduces the use of randomness instead of data splitting to maximize the use of available data. Analysis of the GTEx Liver dataset (v6) suggests that naively obtained confidence intervals would likely not cover the true values of effect sizes and that the number of local genetic polymorphisms influencing the expression level of genes might be underestimated.

Estimating the null distribution for conditional inference and genome-scale screening

Bayesian Updating and Sequential Testing: Overcoming Inferential Limitations of Screening Tests

Error-rate and decision-theoretic methods of multiple testing: Which genes have high objective probabilities of differential expression?

A generalized correlated binomial distribution with application in multiple testing problems

Empirical partially Bayes multiple testing and compound $χ^2$ decisions

Estimating the number and effect sizes of non-null hypotheses

Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing

Bagged Empirical Null p-values: A Method to Account for Model Uncertainty in Large Scale Inference

Interval estimation, point estimation, and null hypothesis significance testing calibrated by an estimated posterior probability of the null hypothesis

A Gene Selection Method for GeneChip Array Data with Small Sample Sizes

Smaller $p$-values in genomics studies using distilled historical information

Selection-adjusted inference: an application to confidence intervals for cis-eQTL effect sizes

Exact conditional p-values from arbitrary ranking of a sample space: An application to genome-wide association studies

Consistent estimation of the proportion of false nulls and FDR for adaptive multiple testing Normal means under weak dependence

Resolving conflicts between statistical methods by probability combination: Application to empirical Bayes analyses of genomic data

An unconditional exact test for the Hardy-Weinberg equilibrium law: sample-space ordering using the Bayes factor.

Two New Estimators for the Proportion of True Null Hypotheses in Multiple Test

Testing a Large Number of Composite Null Hypotheses Using Conditionally Symmetric Multidimensional Gaussian Mixtures in Genome-Wide Studies

The Choice of Null Distributions for Detecting Gene-Gene Interactions in Genome-Wide Association Studies

Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses

A Change-Point Approach to Estimating the Proportion of False Null Hypotheses in Multiple Testing