Abstract:The detection of differentially expressed (DE) genes, that is, genes whose expression levels vary between two or more classes representing different experimental conditions (say, diseases), is one of the most commonly studied problems in bioinformatics. For example, the identification of DE genes between distinct disease phenotypes is an important first step in understanding and developing treatment drugs for the disease. We present a novel approach to the problem of detecting DE genes that is based on a test statistic formed as a weighted (normalized) cluster-specific contrast in the mixed effects of the mixture model used in the first instance to cluster the gene profiles into a manageable number of clusters. The key factor in the formation of our test statistic is the use of gene-specific mixed effects in the cluster-specific contrast. It thus means that the (soft) assignment of a given gene to a cluster is not crucial. This is because in addition to class differences between the (estimated) fixed effects terms for a cluster, gene-specific class differences also contribute to the cluster-specific contributions to the final form of the test statistic. The proposed test statistic can be used where the primary aim is to rank the genes in order of evidence against the null hypothesis of no DE. We also show how a P-value can be calculated for each gene for use in multiple hypothesis testing where the intent is to control the false discovery rate (FDR) at some desired level. With the use of publicly available and simulated datasets, we show that the proposed contrast-based approach outperforms other methods commonly used for the detection of DE genes both in a ranking context with lower proportion of false discoveries and in a multiple hypothesis testing context with higher power for a specified level of the FDR.

Post-clustering difference testing: Valid inference and practical considerations with applications to ecological and biological data

Selective Inference for Hierarchical Clustering

Testing for a difference in means of a single feature after clustering

Testing for Unobserved Heterogeneity via k-means Clustering

Selective inference for k-means clustering

A Bottom-up Approach to Testing Hypotheses That Have a Branching Tree Dependence Structure, with False Discovery Rate Control

Clusters that are not there: An R tutorial and a Shiny app to quantify a priori inferential risks when using clustering methods

Clustering and Classification of Genetic Data Through U-Statistics

Testing for the appropriate level of clustering in linear regression models

Sparse clusterability: testing for cluster structure in high dimensions

Post-clustering Inference under Dependency

Inference on differences between classes using cluster-specific contrasts of mixed effects

Defining Loci in Restriction-Based Reduced Representation Genomic Data from Nonmodel Species: Sources of Bias and Diagnostics for Optimal Clustering

Comparison of Spectral Clustering, K-clustering and Hierarchical Clustering on E-Nose Datasets: Application to the Recognition of Material Freshness, Adulteration Levels and Pretreatment Approaches for Tomato Juices

Valid Post-clustering Differential Analysis for Single-Cell RNA-Seq

Post-selection estimation and testing following aggregated association tests

Testing Informativeness of Covariate-Induced Group Sizes in Clustered Data

Powerful Significance Testing for Unbalanced Clusters

Reply to Chen et al.: Parametric methods for cluster inference perform worse for two-sided t-tests

Statistical Inference for Cluster Trees

Introduction to biostatistics: Part 4, statistical inference techniques in hypothesis testing