Abstract:Background Large-scale sequencing of entire genomes has ushered in a new age in biology. One of the next grand challenges is to dissect the cellular networks consisting of many individual functional modules. Defining co-expression networks without ambiguity based on genome-wide microarray data is difficult and current methods are not robust and consistent with different data sets. This is particularly problematic for little understood organisms since not much existing biological knowledge can be exploited for determining the threshold to differentiate true correlation from random noise. Random matrix theory (RMT), which has been widely and successfully used in physics, is a powerful approach to distinguish system-specific, non-random properties embedded in complex systems from random noise. Here, we have hypothesized that the universal predictions of RMT are also applicable to biological systems and the correlation threshold can be determined by characterizing the correlation matrix of microarray profiles using random matrix theory. Results Application of random matrix theory to microarray data of S. oneidensis , E. coli , yeast, A. thaliana , Drosophila , mouse and human indicates that there is a sharp transition of nearest neighbour spacing distribution (NNSD) of correlation matrix after gradually removing certain elements insider the matrix. Testing on an in silico modular model has demonstrated that this transition can be used to determine the correlation threshold for revealing modular co-expression networks. The co-expression network derived from yeast cell cycling microarray data is supported by gene annotation. The topological properties of the resulting co-expression network agree well with the general properties of biological networks. Computational evaluations have showed that RMT approach is sensitive and robust. Furthermore, evaluation on sampled expression data of an in silico modular gene system has showed that under-sampled expressions do not affect the recovery of gene co-expression network. Moreover, the cellular roles of 215 functionally unknown genes from yeast, E. coli and S. oneidensis are predicted by the gene co-expression networks using guilt-by-association principle, many of which are supported by existing information or our experimental verification, further demonstrating the reliability of this approach for gene function prediction. Conclusion Our rigorous analysis of gene expression microarray profiles using RMT has showed that the transition of NNSD of correlation matrix of microarray profile provides a profound theoretical criterion to determine the correlation threshold for identifying gene co-expression networks.

Machine learning analysis of RB-TnSeq fitness data predicts functional gene modules in Pseudomonas putida KT2440

Classification of bacterial plasmid and chromosome derived sequences using machine learning

Multi-Omics integration can be used to rescue metabolic information for some of the dark region of the Pseudomonas putida proteome

Genome‐Wide Fitness and Genetic Interactions Determined by Tn‐seq, a High‐Throughput Massively Parallel Sequencing Method for Microorganisms

Integrating natural language processing and genome analysis enables accurate bacterial phenotype prediction

Transposon sequencing: A powerful tool for the functional genomic study of food-borne pathogens

Constructing Gene Co-Expression Networks and Predicting Functions of Unknown Genes by Random Matrix Theory

Assembling bacterial puzzles: piecing together functions into microbial pathways

Prediction of prokaryotic transposases from protein features with machine learning approaches

Mutant phenotypes for thousands of bacterial genes of unknown function

Predicting metabolic modules in incomplete bacterial genomes with MetaPathPredict

Dual transposon sequencing (Dual Tn-seq) to probe genome-wide genetic interactions

Identification and characterization of proteins of unknown function (PUFs) in Clostridium thermocellum DSM 1313 strains as potential genetic engineering targets

Transcriptome-guided parsimonious flux analysis improves predictions with metabolic networks in complex environments

PangenomeNet: a pan-genome-based network reveals functional modules on antimicrobial resistome for Escherichia coli strains

MICROPHERRET: MICRObial PHEnotypic tRait ClassifieR using Machine lEarning Techniques

Identification of new genes on a whole genome scale using saturated reporter transposon mutagenesis

Unitig-centered pan-genome machine learning approach for predicting antibiotic resistance and discovering novel resistance genes in bacterial strains

Integrating data and knowledge to identify functional modules of genes: a multilayer approach

Rapid acquisition and model-based analysis of cell-free transcription–translation reactions from nonmodel bacteria

Predicting bacterial fitness in Mycobacterium tuberculosis with transcriptional regulatory network-informed interpretable machine learning